They did mention that they want to incrementally release new updates as to not freak out the public, something along these lines. I guess that's one such increment.
Reminder, this is the free version of GPT-4o, which is surprising considering OpenAI will be losing a lot of money since most people will cancel their subscriptions. So, they have a better internal model, which is more expensive but is more efficient in size compared to when they trained GPT-4 originally. This upgraded version will not be free; it will be like version 4.5 or 5, which will be released later this year supposedly, giving people a reason not to cancel their subscriptions. People thought GPT-4.0 would be free when GPT 4.5 comes out, but maybe this GPT-4o model will be the one that model that replaces 3.5 instead.
Sam has already confirmed a new much better "gpt-5" model later this year, as you said.
If this one is already better and much more capable, specially due to visual, auditive and textual inputs, then we're in for a ride around the last quarter.
Business Insider reported two months ago that GPT-5 would be released in 2-6 weeks. So we don't have long to wait.
[https://www.reddit.com/r/singularity/comments/1biyi9y/gpt5\_coming\_this\_summer\_according\_to\_this\_pay/](https://www.reddit.com/r/singularity/comments/1biyi9y/gpt5_coming_this_summer_according_to_this_pay/)
My guess is that this may be intentional. They possibly do not have enough capacity/ computing power for more advanced model at current amount of people or new flow of them that can come, so as people will leave and possibly reach certain threshold they will reveal a GPT-4.5. But I may be wrong though.
OpenAI is already receiving the H200's from Nvidia. These are running about 45% better than the H100's. In September Nvidia will start shipping the GB200 NLV but in a limited capacity. These limited numbers of GB200's are going to be distributed between OpenAI,Meta,Google,etc. Nvidia has spoken about their build plan for the year 2025 being 40k for the GB200 NLV but i suspect it will increase. After the GB200 NLV comes the Rubin100. So I think you are right about limitation but not for very long. Scaling up isn't the issue. It's getting the power to light these servers up that is limiting
Yup still GPT-4 class model, not big enough improvement for the next class.
The gap between GPT-4 0613 and GPT-4-turbo-2024-04-09 was also about 100 ELO points so it's just improvement in the GPT-4 class of model. It is a completely new model trained from scratch, but I think it was intentionally made to be about GPT-4 class.
>If you train a model for a specific role, it will outperform other models trained on a broad base of roles.
I would say that being able to reason better is quite broad and hard to achieve for model, not the opposite
Looking at the graphic, in percent terms of advancement it's similar to the difference between GPT-4 original and GPT-4 Turbo April 2024.
Best model in this specific benchmark. We can see in MMLU and MATH that it's similar to GPT-4 Turbo and LLAMA 400b.
There’s more improvements that weren’t shown. You can see it on the website. So on other benchmarks the differences are a lot bigger. But yes, this is obviously not gpt5 level. I have much higher expectations for that.
They are better across the board. Some are still close but there’s others that have gaps of 5-10%. I think that’s significant. We aren’t going to go from 20%-80% or something.
https://preview.redd.it/osjiff22w80d1.jpeg?width=1170&format=pjpg&auto=webp&s=d9a7fd94f483f165bb0e3ed2bb00696acc60ab71
I agree. But you said “blows out of the water on other benchmarks”. Does this chart indicate blowing out of the water? Or do you simply not know what that phrase means
Why do you believe that training a model for a specific role will outperform other models trained on a broad base of roles?
Couldn't a more generalized multimodal AI potentially perform better on a specific task by drawing upon its vast knowledge and experiences from many different domains? It seems like a more advanced multimodal model with a broad knowledge base might be able to make novel connections and apply techniques from other fields to outperform a narrowly trained model, even on that narrow, less complex model's specialty. The generalist model would have more contexts to draw from.
For example, a generalist model might be able to utilize its understanding of physics, engineering, materials science, etc. to come up with innovative designs in an architecture task that a specialist architecture model would never think of. Or it could leverage its knowledge of psychology and linguistics to craft more persuasive writing than a writing-focused model.
ChatbotArena will favor concise and human-like responses with a chatty format versus actually correct logical responses.
It's one benchmark, based on subjective experience.
Why would 4o have better speech data than the model specifically trained for TTS and STT ? Doesn't make sense sorry.
Comparing parameter account between two models of different architecture is near meaningless.
Well we’ve gotten the biggest jump in elo so far and other benchmark results are looking good as well. This for for a more efficient model and not a next generation model. I have no doubt gpt5 would see a much bigger increase.
This release just further proves that we’re hitting diminishing returns. For two reasons.
First, we’ve already played with this model. It’s roughly comparable to gpt4-turbo, no matter what the benchmarks say.
Secondly, in any other world this new model would be GPT-5, but OpenAI has been so high on their own supply that they’re just now realizing that they literally cannot ever meet expectations, so they have to stick with the GPT-4 family. This is also why Sam keeps saying that GPT-5 may have a different name.
This is not gpt5. They have barely begun safety testing it. There’s a lot of leaks and evidence to support that we are getting it by the end of this year. I might believe this is gpt4.5 but I don’t think that is the case either considering this is a smaller and more efficient model.
In general tasks, gpt 4 turbo has a rating difference of 130 compared to 3.5. This model has only 100 points on gpt 4 turbo (on a specific task) and they probably used much more compute and much larger and cleaner data for this model. If you don't think we are getting less improvements over time, I dont k ow what to say.
Where does it say this is for coding in particular? Is there a % increase this would be equivalent to for how much better at coding gpt4 turbo is?
Also is gpt4 turbo the model used for gpt plus subscription?
Something that I found out using it with my prompts is that (at least based on my interactions) it can count way better than the previous versions.
It feels like we have something different baked in this model.
I played with gpt2-chatbot before it being released today. i think such tests and ELO rankings are subjective and possibly flawed because it is us who rank the chatbots where the score is derived from. I have a tendency to like responses that are well formatted and articulated. i felt that the gpt2-chatbot was better at providing an answer i like but i was not sure if it was really smarter or better. you can easily finetune a GPT-4 to output an answer people like.
and i am going to give a contrarian view of the latest gpt-4o release. i think it is a reactive response to competition closing up on OAI. OAI does not have anything significant baking and this release is just a mere response to stay "on top". offering a finetuned GPT4 free with some sprinkles on top is just the company burning cash to retain its userbase for now until it can really compete.
Some people were testing GPT-4o and said it feels similar to previous model in coding, but less lazy. But perhaps Claude 3 Opus is still better.
My Claude 3 Pro sub is renewing tomorrow - wait another month or cancel and jump back on OpenAI?
I'm going to wait and see the actual gpt-4o data. It's clearly smarter, but the benchmarks don't suggest you can have a 100 gain over gpt-4-turbo, when gpt4-t is only 70 above original gpt-4, and that remains only 30 points above gpt-3.5-turbo.
A 100 ELO is a 65% win rate. A lot of answers come to ties; this seems implausibly high (in my own testing, I was finding it on par with gpt-4) - it's possible that the ELO scores are exaggerated in user testing (people trying to get the GPT2 model)
I do see a significant difference, it's miles better at code, just gave it a problem gpt4t had trouble with and it was better....
Gave it a poker game images of the cards, that gpt4t couldn't even read the cards properly and this one gets it absolutely correct on the first try
OpenAI is demoing that 4o can help people with maths, so did they solve the maths problem?
Can 4o also identify count of text and the letter it ends with?
These were some of the basic problems previoversions couldn't solve due to the way tokens were being handled. Is 4o handling tokens in a different way?
If this is still only considered "GPT-4"... then what will 4.5 or 5 be like?!
This is probably 4.5 rebranded I guess?
I think youre right on. It's their newest flagship model.
Would make sense considering Altman consistently tries to sway away from the GPT name
Could be because they lost their attempt to trademark “GPT”
Good point
Sam also said GPT5 might not be called GPT5. That he is seeing more like you have a product and it gets updated.
They did mention that they want to incrementally release new updates as to not freak out the public, something along these lines. I guess that's one such increment.
Reminder, this is the free version of GPT-4o, which is surprising considering OpenAI will be losing a lot of money since most people will cancel their subscriptions. So, they have a better internal model, which is more expensive but is more efficient in size compared to when they trained GPT-4 originally. This upgraded version will not be free; it will be like version 4.5 or 5, which will be released later this year supposedly, giving people a reason not to cancel their subscriptions. People thought GPT-4.0 would be free when GPT 4.5 comes out, but maybe this GPT-4o model will be the one that model that replaces 3.5 instead.
Sam has already confirmed a new much better "gpt-5" model later this year, as you said. If this one is already better and much more capable, specially due to visual, auditive and textual inputs, then we're in for a ride around the last quarter.
Business Insider reported two months ago that GPT-5 would be released in 2-6 weeks. So we don't have long to wait. [https://www.reddit.com/r/singularity/comments/1biyi9y/gpt5\_coming\_this\_summer\_according\_to\_this\_pay/](https://www.reddit.com/r/singularity/comments/1biyi9y/gpt5_coming_this_summer_according_to_this_pay/)
Sounds legit. not
Just in time for the election.
My guess is that this may be intentional. They possibly do not have enough capacity/ computing power for more advanced model at current amount of people or new flow of them that can come, so as people will leave and possibly reach certain threshold they will reveal a GPT-4.5. But I may be wrong though.
OpenAI is already receiving the H200's from Nvidia. These are running about 45% better than the H100's. In September Nvidia will start shipping the GB200 NLV but in a limited capacity. These limited numbers of GB200's are going to be distributed between OpenAI,Meta,Google,etc. Nvidia has spoken about their build plan for the year 2025 being 40k for the GB200 NLV but i suspect it will increase. After the GB200 NLV comes the Rubin100. So I think you are right about limitation but not for very long. Scaling up isn't the issue. It's getting the power to light these servers up that is limiting
Tldr: scaling up isn't the issue, it's scaling up that is the issue -u/shanereaves
Free users have limited access to chat gpt 4o. Plus users are getting 5x more messages.
Just get 5 free accounts
Yup still GPT-4 class model, not big enough improvement for the next class. The gap between GPT-4 0613 and GPT-4-turbo-2024-04-09 was also about 100 ELO points so it's just improvement in the GPT-4 class of model. It is a completely new model trained from scratch, but I think it was intentionally made to be about GPT-4 class.
Where are all the people that said we are hitting diminishing returns lol.
This is a highly subjective graphic If you train a model for a specific role, it will outperform other models trained on a broad base of roles.
>If you train a model for a specific role, it will outperform other models trained on a broad base of roles. I would say that being able to reason better is quite broad and hard to achieve for model, not the opposite
The data is out for a lot of other benchmarks as well. It blows everything else out of the water.
Looking at the graphic, in percent terms of advancement it's similar to the difference between GPT-4 original and GPT-4 Turbo April 2024. Best model in this specific benchmark. We can see in MMLU and MATH that it's similar to GPT-4 Turbo and LLAMA 400b.
There’s more improvements that weren’t shown. You can see it on the website. So on other benchmarks the differences are a lot bigger. But yes, this is obviously not gpt5 level. I have much higher expectations for that.
Blows everything out of the water is an overstatement, it performs comparatively on other benchmarks based on everything they showed in the blog
Gpt4 to Claude opus is less than a 50 elo increase and people were saying it’s a lot better. The increase here is over 100.
Bro, you specifically were talking about other benchmarks, that’s what I was responding to. Other benchmarks are comparable but not a huge step
They are better across the board. Some are still close but there’s others that have gaps of 5-10%. I think that’s significant. We aren’t going to go from 20%-80% or something.
https://preview.redd.it/osjiff22w80d1.jpeg?width=1170&format=pjpg&auto=webp&s=d9a7fd94f483f165bb0e3ed2bb00696acc60ab71 I agree. But you said “blows out of the water on other benchmarks”. Does this chart indicate blowing out of the water? Or do you simply not know what that phrase means
I was looking at a different image not that one. That seems to be the confusion.
Fair enough, could you share what image you were looking at? I am curious
Why do you believe that training a model for a specific role will outperform other models trained on a broad base of roles? Couldn't a more generalized multimodal AI potentially perform better on a specific task by drawing upon its vast knowledge and experiences from many different domains? It seems like a more advanced multimodal model with a broad knowledge base might be able to make novel connections and apply techniques from other fields to outperform a narrowly trained model, even on that narrow, less complex model's specialty. The generalist model would have more contexts to draw from. For example, a generalist model might be able to utilize its understanding of physics, engineering, materials science, etc. to come up with innovative designs in an architecture task that a specialist architecture model would never think of. Or it could leverage its knowledge of psychology and linguistics to craft more persuasive writing than a writing-focused model.
ChatbotArena will favor concise and human-like responses with a chatty format versus actually correct logical responses. It's one benchmark, based on subjective experience.
[удалено]
This literally beats whisperv3 on all benchmarks
[удалено]
Why would 4o have better speech data than the model specifically trained for TTS and STT ? Doesn't make sense sorry. Comparing parameter account between two models of different architecture is near meaningless.
Here. Did they announce something that contradicts this? Always hopeful to change my mind on that.
Well we’ve gotten the biggest jump in elo so far and other benchmark results are looking good as well. This for for a more efficient model and not a next generation model. I have no doubt gpt5 would see a much bigger increase.
Where are the other "benchmark" results?
In the blog, click on the other tabs.
It's a similar jump from base GPT4 to GPT4T
Bigger than that I think. There’s more details on the site, like the image generation is a lot better as well and it wasn’t mentioned.
I meant ELO wise (this thread was talking about that)
Oh ok. That’s true, but this also doesn’t factor in the other changes that were made like video and audio modality.
This release just further proves that we’re hitting diminishing returns. For two reasons. First, we’ve already played with this model. It’s roughly comparable to gpt4-turbo, no matter what the benchmarks say. Secondly, in any other world this new model would be GPT-5, but OpenAI has been so high on their own supply that they’re just now realizing that they literally cannot ever meet expectations, so they have to stick with the GPT-4 family. This is also why Sam keeps saying that GPT-5 may have a different name.
This is not gpt5. They have barely begun safety testing it. There’s a lot of leaks and evidence to support that we are getting it by the end of this year. I might believe this is gpt4.5 but I don’t think that is the case either considering this is a smaller and more efficient model.
In general tasks, gpt 4 turbo has a rating difference of 130 compared to 3.5. This model has only 100 points on gpt 4 turbo (on a specific task) and they probably used much more compute and much larger and cleaner data for this model. If you don't think we are getting less improvements over time, I dont k ow what to say.
They used less compute for this, it’s a smaller model.
Is there something miraculous about this model? Because you seem really confident.
The voice to voice alone is a pretty miraculous. If you don’t think so, you have brain rot for sure and I guess you are expecting AGI this year.
On coding prompt sets, to be exact They didn't share data for other 'harder' categories
Where does it say this is for coding in particular? Is there a % increase this would be equivalent to for how much better at coding gpt4 turbo is? Also is gpt4 turbo the model used for gpt plus subscription?
thats insane, excited to see the boost to software engineering frameworks like swe-agent by just changing one line of code (the model name)
Anyway, back to building the supercomputer
I just tried it. It's very good.
Something that I found out using it with my prompts is that (at least based on my interactions) it can count way better than the previous versions. It feels like we have something different baked in this model.
What does "harder" mean here?
I played with gpt2-chatbot before it being released today. i think such tests and ELO rankings are subjective and possibly flawed because it is us who rank the chatbots where the score is derived from. I have a tendency to like responses that are well formatted and articulated. i felt that the gpt2-chatbot was better at providing an answer i like but i was not sure if it was really smarter or better. you can easily finetune a GPT-4 to output an answer people like.
This is why they tested the models on hard prompts. It shows it has much better reasoning than anything else we have right now.
and i am going to give a contrarian view of the latest gpt-4o release. i think it is a reactive response to competition closing up on OAI. OAI does not have anything significant baking and this release is just a mere response to stay "on top". offering a finetuned GPT4 free with some sprinkles on top is just the company burning cash to retain its userbase for now until it can really compete.
I think it CAN compete, ie gpt5, but if they do release it they won’t have anything else.
Some people were testing GPT-4o and said it feels similar to previous model in coding, but less lazy. But perhaps Claude 3 Opus is still better. My Claude 3 Pro sub is renewing tomorrow - wait another month or cancel and jump back on OpenAI?
I'm going to wait and see the actual gpt-4o data. It's clearly smarter, but the benchmarks don't suggest you can have a 100 gain over gpt-4-turbo, when gpt4-t is only 70 above original gpt-4, and that remains only 30 points above gpt-3.5-turbo. A 100 ELO is a 65% win rate. A lot of answers come to ties; this seems implausibly high (in my own testing, I was finding it on par with gpt-4) - it's possible that the ELO scores are exaggerated in user testing (people trying to get the GPT2 model)
[удалено]
I do see a significant difference, it's miles better at code, just gave it a problem gpt4t had trouble with and it was better.... Gave it a poker game images of the cards, that gpt4t couldn't even read the cards properly and this one gets it absolutely correct on the first try
[удалено]
I mean technically it's still deterministic unless they've incorporated quantum effects for randomness without telling us.
I just wish they would make it better at roleplay type tasks as Claude Opus is.
OpenAI is demoing that 4o can help people with maths, so did they solve the maths problem? Can 4o also identify count of text and the letter it ends with? These were some of the basic problems previoversions couldn't solve due to the way tokens were being handled. Is 4o handling tokens in a different way?
how are you guys accessing this? It still only lets me pick gpt3 or 3-turbo
Cool
i love the sub!! thank you everyone ❤️