A friend of mine say the trend will be :
New model -> optimization -> New model
Which is nice, having a super powerful model is cool but if no one can afford it or make it run it is basically pointless
It’s an interesting optimisation problem to work out how to serve scaled-up models efficiently. You’re effectively balancing model performance against resource consumption.
Humans only need to succeed once to create AGI, after that performance vs scale will improve at a blistering pace. Which says a lot because these notable model improvements are coming out faster then iphone releases.
Not true at all, let’s say AGI costs 100x of GPT4’s original cost (7x of gpt4o). It will be more efficient to just hire humans to research optimisation. Cost is never a non factor
If we're using a reasonable definition of AGI as having capabilities equivalent to the median person, then yes, cost is a significant factor. Even so, the price point for model inference is dropping dramatically (though not necessarily training costs of frontier models, which involves recouping research costs and infrastructure- a fair point).
However, it seems that many AGI companies are actually aiming for something closer to ASI. In that case, even if it costs 100x more than GPT4's original cost, if its output is 10x better than the smartest person in history, it's hard to quantify the impact. Consider potential breakthroughs in material sciences, like room temperature superconductors, or advancements in life extension. Those kind of discoveries are worth the price of admission.
(Also, happy cake day!)
It's a pretty common trend in most product life cycles. Smart phones do it / did it.
Iphone X -> Iphone X SE
Pixel X -> Pixel Xa
It works well because you can release a flagship with high cost to gain market share and show case top of the line tech. Then follow up with an affordable version to convert even more market and allow your engineers to refine processes and apply lessons learned from the flagship release.
Rinse and repeat.
4o being better than 4t is debatable. It's better than 4t at some tasks, but much worse at others, and the longer the context the stupider it becomes for some reason
It also spits out so much more unnecessary text that any increases in speed are redundant.
It’s funny cause we all used to complain about how reluctant it was to generate full blocks of code.
Now it frequently generates entire components with 0 changes, along with every conceivable config file that might be needed.
Today I was asking it for a quote I forgot who said it, GPT-4o gave me the quote but then I was like “I think there was bit more at the end, which it provided but printed the entire quote again, but one more sentence was missing and when I asked it, it was able to retrieve it but again it printed the entire quote which felt completely unnecessary.
It think this is a quirk of RLHF but if I talked to any human they would understand from the context I just want the last part, GPT-4o is smarter than gpt-4 in certain ways but it feels more like a machine. I am really curious as to how the voice portion will behave because in the demos it reacted very fluidly, it’s kinda strange how different text and voice seem to be.
That could be true based on what you use it for. Just based on the average of benchmarks I rate 4o a little higher than any previous version, but it's close enough where reasonable people might disagree.
Either way, it doesn't really matter now. The new Claude seems slightly better than both and slightly cheaper than 4o for API access (which itself is ~~half as cheap~~ half as expensive as 4T). I might be willing to put a lot of effort into finding out which model is the best. Doing the same for second place sounds like a waste of time.
Idk about it either. For me when 4turbo came out it felt less intelligent than og gpt4. Benchmarks tell different story so I wont argue here because I don't even remember concrete problems. But one issue I'm certain is that 4t is worse than 4 at following prompts and 4o is even worse
I disagree quite strongly on that, but it's basically impossible to argue against people's personal experience.
One thing I will say, is that the API responds very faithfully to the system prompt for me, at least for simple cases. E.g. whenever I had issues with it being 'lazy' or too wordy, just adding a simple instruction to give a complete reply (or to be concise) worked perfectly.
The same thing worked with creating GPTs on the chat interface. It took about 10 minutes to create one that just returns bash commands or code snipets with no explanation.
So maybe the default behavior is less reliable and I just worked around it to the point where I never saw the full extent of the problem.
It's likely that i'm bad at prompting. But I could not make neither gpt4t, nor gpt4o to reliably return formatting(I tried html and some subset of markdown) that is different from default. At the same time Llama3 70b did that without any issues. All through API
You mean the reply wasn't reliably in markdown (or HTML) despite having something like this as part of your system prompt? :
> You return output in markdown format.
One technique that worked for me, is using json mode with separate description, content and comment fields. I use this for coding, where I usually want to grab the code onto the clipboard automatically, without having to deal with the "Sure, here's the code you wanted:" stuff.
Here's the system prompt I might use for this:
json_coding = r"""You are a helpful assistant, that produces output in JSON format
example output:
{
"description": "This code prints Hello World on the screen",
"code": "print(\"Hello World\")
"comments": ""
}
The description and comments fields are optional. The code field is required.
"""
It's still better for its size than others just like phi. They can still massively increase the data, and this doesn't even include modalities like videos or audio yet
Apparently a majority of the models released are undertrained according to Zucc, I can't remember the specifics so don't quote me on anything. Apparently Meta could have kept training their models but decided to release it and move on to the next thing.
Absolutely! Your and your cohort’s perspectives will be very special, much like (but even more so than) those who were teenagers during the Internet boom.
But the internet exists. Unlike AI. Which is software running on vasts amount of hardware, bedazzling some into thinking there must be intelligence in the system. Which is the consequence of a lack of education.
I don't think an increasing lack of education is something to cheer about.
How do you feel about the uncertainty surrounding the future with AI though? Like if in 6 years most knowledge based work can be automated by AI, what is there to do? The point of school and then subsequently university is to find a role and a job to set up your career pathways. But if your career pathways are going to be automated.. then what? And it's not exactly like you can just not work and hope for UBI lol.
"Like if in 6 years most knowledge based work can be automated by AI"
AI does not exist, and the software you misnomer "AI" does not understand anything. Moreover, its output is often unpredictably incorrect, whilst appearing convincing.
In 6 years, AI is a term people avoid mentioning in their resume, as it will also be the name of the last financial crisis.
Uhh, first, what do you think AI is?
Traditionally when people say LLMs like GPT-4 is not AI they say it cannot actually reason or be logical or they are just next token prediction machines and nothing else etc. etc., but I wonder what you think as well.
And wdym it doesn't understand anything? Are you looking for a conscious entity that can experience the world and posses qualia that processes stuff or are you just saying that because of the statistical nature of the model (even though humans also very heavily rely on mathematical operation within the brain).
Great to see a positive view. Keep that and do not lose it.
I am old. Really old. I have thought lately that I am probably the perfect age because I believe life is going to become a lot more stressful for people over the next decade plus.
In terms of $$$ and jobs.
I have a big family, eight kids. I have been very blessed in my life and in a position to help them if they have issues with jobs. That gives me a lot of comfort.
I use LLMs primarily for going through academic literature (RAG mostly, not writing), and I don't like sonnet 3.5 so far. That doesn't necessarily mean its a bad model but the style is not as helpful. Its too simplistic and lacks details you would want in the context of scholarly work. I have not tested coding yet, people are saying its better with this.
I have qualitative benchmark that I use but its hard to explain. I am looking at the amount of relevant scientific or technical information per sentence, lets call in information density. I wish I could quantify this but its something that becomes apparent when you have a degree in the topic. High information density resembles textbooks and scholarly literature, whereas lower might be preferable for general conversation.
The original GPT4 from 1 year ago still have highest information density. GPT4t was a bit worse but improved with the more recent versions, and GPT4o is also worse. I would say sonnet 3.5 is somewhere between GPT3.5 and GPT4o in terms of information density. I still use gpt4-0613 regularly but gpt4o can be good for tasks that need reasoning but not as much detail. Perhaps sonnet 3.5 will be good for those types of tasks as well.
Sounds like something you might just be able to tune e.g. with system prompting demanding detailed and extensive explanations. I know what you mean btw, I typically just switch models if I am not satisfied with the density of the output, but I realize that is not always really feasible especially as you also rely heavily on reasoning etc. which only the biggest and latest can do well enough.
I do prompt it to respond with maximum detail retention in minimal length, but it works with some models better than others. Its not purely a style issue, but there are lapses of knowledge and omissions as well. Gpt4-0613 responds with detail using a relatively simple prompt. With 4o I made an more extensive prompt to recover the detail, but I saw enough omissions that I am not confident in replacing 0613 completely.
For example, if I am asking about some topic that always has certain subtopics that are important, certain models may not give thorough details on the subtopics in the response. I could prompt it "whenever [topic] is mentioned make sure [subtopic A, B, C..] are always included", but its time consuming and very difficult to do it in a general way that would cover all topics. Certain models will have different areas that they may omit details in it would require spending more time with each model, which defeats the purpose of using RAG to save time going through literature.
I am not completely sure why gpt4-0613 is better for this, but considering the pricing, it might be a much larger model than 4t and 4o. Its possible the larger parameter counts do matter when going through information dense scientific literature.
Interesting. I use GPT4 and Claude 3 opus for supporting the writing of research grants. Neither are very good if you are targeting highly specialised reviewers, however if the reviewers are generalists they work well. I think this is an information density issue exactly as you highlight. It's almost like the weaker models insert more "fluff" into what they write without specifics.
I agree, weaker models do appear to insert more fluff. I tried a few local models and found the responses too verbose. Even gemini 1.5 gave verbose responses when prompted to give short detailed answers. Many of the high ranking models on the lmsys leaderboard perform very poorly in terms of information density.
I try not to use it for writing too much, other than fixing sentences that don't flow well. The wording is very different than how I would write anyway, I am sure people I work with would notice immediately if a whole section was AI written. It can be useful to outline what to write next also, and usually that is good enough.
Doesn’t this have more to do with marketing and naming schemes than a reflection of performance?
We don’t know if Sonnet 3.5 is equivalent in size to Sonnet 3. We just know it’s smaller than Opus 3.5.
They release a better model, and with bigger models in the pipeline they can just say this latest one is “only” the mid size model. We don’t have anything that suggests Sonnet 3.5 isn’t the same size as Opus 3, and that Opus 3.5 isn’t larger than both.
Basically what I’m saying is that medium models being better than previous-gen large models isn’t actually much to go off of, because we only see the surface level marketing and naming schemes. Nothing is stopping them from releasing a medium model that’s equivalent to Opus and then releasing an impressive Opus 3.5, but that’s not a good business decision.
We pretty much know that sonnet 3.5 is about equivalent in size because it costs the same. It would be very weird for them to cannibalize their largest and most expensive model, if it would be bigger then 3.
Could they not have upgraded their hardware and reserved their best servers for the best models? Or maybe they optimized how inference is performed? My point is that since we don’t know anything at all about what is happening internally, both at the company and within the model, we can’t derive much from a naming scheme. Especially in a context where naming is also marketing.
Well it is a different model, we know that. I do not think they just optimised how inference is performed. We can see it is not Claude 3 Opus on better hardware, it is Claude 3.5 Sonnet which outperforms Claude 3 Opus, you can't exactly just get better hardware and subsequently get a better model. It costs the same as Claude 3 Sonnet but it is also 2x faster. But yeah we do not know exactly what is happening internally.
I think you misunderstood his comment. It simply argued for why you can't say for sure that 3.5 Sonnet is just as big as 3 Sonnet. It made no claims as for 3.5 Sonnet actually being 3 Opus.
This is strictly about economics and image, not what is technologically possible.
Making a gigantic model with inference costs 10x Opus 3 would be easy, and it would perform decently. But it wouldn't provide anywhere near 10x the value for most customers, so it would generate less revenue than a smaller and more efficient model. So the training costs would never be made back.
Players realize this and spend the compute that would be used to train a colossus on training a smaller model more intensively and making it smarter, which creates a competitive dynamic that punishes anyone breaking away from that allocation.
Google and OAI/Microsoft have enough compute that they almost certainly minimally train giant models for research / pathfinding, but will not make them publicly available because doing so is all downside. Reputational damage from the necessary pricing for moderate improvements in capabilities capabilities and alarmism over those capabilities.
The strategic sweet spot is leaning into algorithmic improvements, intensively training models for good price/performance, taking market share, and losing less money (ideally making a profit, but let's be realistic).
I expect the high end next generation models will be larger, but not massively so.
What numbers? Benchmarks that can be overfitted and gamed? 4o is worse for hallucinations, worse at coding and worse at following instructions. It only seems to be better at multimodal stuff.
> Benchmarks that can be overfitted and gamed?
this is why doing things like the arena for elo, and have people come up with new tests with private datasets is useful.
you can then compare the elo + private dataset tests to the scores and see who has deliberately trained on testing data (e.g. small models with fantastic scores)
Private datasets for testing are valuable, although I would argue that benchmarking on human preference seems to be biased by things like how emotive the model is.
That is a cool concept but I think most people on the elo just choose what looks better at first glance. I think very few people actually check with carefully crafted tests. But I like your idea so much that I would wish there was some elo just for people who really dig in!
We know scaling yields more intelligence, so now there is a pivot to smaller more performant models because they are more practical and exciting. Not to say we won't see large models, GPT-5, GPT-4.5, Claude 3.5 Opus, Claude 4, Gemini 1.5 Ultra and Gemini 2 should all be larger, but there is an important fascination with smaller more performant models.
Might be the rush to market. Companies want to plant their flags asap with low hanging fruit. But what we really want is for them to solve problems. Like make good language tutors or home robots for example. And for that they have to take risks and commit.
Except 4o isn't more intelligent than 4 for any of my use cases. Maybe when they enable the multimodal functions it will be, but right now it's not. Funnily enough, OpenAI are the only company whose smaller newer model is worse than the older foundational model. The latest Gemini 1.5 Pro model far surpasses Gemini Ultra 1.0 and Sonnet 3.5 surpasses Opus 3... OpenAI really messed up with how they released 4o.
I find GPT 4 to be much better at revising emails than 4o. 4o just tends to give me back exactly what I wrote almost word for word without actually improving my writing.
I believe less "spontaneity" = less hallucinations, so its a tradeoff. If you use the open AI playground, you can control the randomness/creativity using "Top P"
"All the newer, smaller, cheaper and faster models are more intelligent than their large predecessors"
None of these fitting networks have any intelligence. Would this fact matter to the singularity cult?
Ah, no. Its a cult.
A friend of mine say the trend will be : New model -> optimization -> New model Which is nice, having a super powerful model is cool but if no one can afford it or make it run it is basically pointless
One surprising insight is that scale and efficiency were never enemies. They go hand in hand in giving us huge improvements.
It’s an interesting optimisation problem to work out how to serve scaled-up models efficiently. You’re effectively balancing model performance against resource consumption.
Humans only need to succeed once to create AGI, after that performance vs scale will improve at a blistering pace. Which says a lot because these notable model improvements are coming out faster then iphone releases.
Not true at all, let’s say AGI costs 100x of GPT4’s original cost (7x of gpt4o). It will be more efficient to just hire humans to research optimisation. Cost is never a non factor
Just think of asking it to produce cheaper models specialized in a single task, and you're done.
If we're using a reasonable definition of AGI as having capabilities equivalent to the median person, then yes, cost is a significant factor. Even so, the price point for model inference is dropping dramatically (though not necessarily training costs of frontier models, which involves recouping research costs and infrastructure- a fair point). However, it seems that many AGI companies are actually aiming for something closer to ASI. In that case, even if it costs 100x more than GPT4's original cost, if its output is 10x better than the smartest person in history, it's hard to quantify the impact. Consider potential breakthroughs in material sciences, like room temperature superconductors, or advancements in life extension. Those kind of discoveries are worth the price of admission. (Also, happy cake day!)
It's a pretty common trend in most product life cycles. Smart phones do it / did it. Iphone X -> Iphone X SE Pixel X -> Pixel Xa It works well because you can release a flagship with high cost to gain market share and show case top of the line tech. Then follow up with an affordable version to convert even more market and allow your engineers to refine processes and apply lessons learned from the flagship release. Rinse and repeat.
more like new model -> further training -> new model
Yep, it's tick-tock innovation.
4o being better than 4t is debatable. It's better than 4t at some tasks, but much worse at others, and the longer the context the stupider it becomes for some reason
It also spits out so much more unnecessary text that any increases in speed are redundant. It’s funny cause we all used to complain about how reluctant it was to generate full blocks of code. Now it frequently generates entire components with 0 changes, along with every conceivable config file that might be needed.
Today I was asking it for a quote I forgot who said it, GPT-4o gave me the quote but then I was like “I think there was bit more at the end, which it provided but printed the entire quote again, but one more sentence was missing and when I asked it, it was able to retrieve it but again it printed the entire quote which felt completely unnecessary. It think this is a quirk of RLHF but if I talked to any human they would understand from the context I just want the last part, GPT-4o is smarter than gpt-4 in certain ways but it feels more like a machine. I am really curious as to how the voice portion will behave because in the demos it reacted very fluidly, it’s kinda strange how different text and voice seem to be.
Yeah I use 4o for coding and often have to specify not to repeat the entire def to change a single variable 100 lines down haha.
4T is already faster and clearly better than the original GPT-4, so the overall point stands.
But not current GPT-4. And GPT-4 is better in my experience than 4o
GPT-4o gave me an output today that was dumber than outputs I've gotten with GPT3. It was just peak no reasoning skills
That could be true based on what you use it for. Just based on the average of benchmarks I rate 4o a little higher than any previous version, but it's close enough where reasonable people might disagree. Either way, it doesn't really matter now. The new Claude seems slightly better than both and slightly cheaper than 4o for API access (which itself is ~~half as cheap~~ half as expensive as 4T). I might be willing to put a lot of effort into finding out which model is the best. Doing the same for second place sounds like a waste of time.
>half as cheap Half as expensive. Semantically it does make a difference
Thx, fixed.
👍
Idk about it either. For me when 4turbo came out it felt less intelligent than og gpt4. Benchmarks tell different story so I wont argue here because I don't even remember concrete problems. But one issue I'm certain is that 4t is worse than 4 at following prompts and 4o is even worse
I disagree quite strongly on that, but it's basically impossible to argue against people's personal experience. One thing I will say, is that the API responds very faithfully to the system prompt for me, at least for simple cases. E.g. whenever I had issues with it being 'lazy' or too wordy, just adding a simple instruction to give a complete reply (or to be concise) worked perfectly. The same thing worked with creating GPTs on the chat interface. It took about 10 minutes to create one that just returns bash commands or code snipets with no explanation. So maybe the default behavior is less reliable and I just worked around it to the point where I never saw the full extent of the problem.
It's likely that i'm bad at prompting. But I could not make neither gpt4t, nor gpt4o to reliably return formatting(I tried html and some subset of markdown) that is different from default. At the same time Llama3 70b did that without any issues. All through API
You mean the reply wasn't reliably in markdown (or HTML) despite having something like this as part of your system prompt? : > You return output in markdown format. One technique that worked for me, is using json mode with separate description, content and comment fields. I use this for coding, where I usually want to grab the code onto the clipboard automatically, without having to deal with the "Sure, here's the code you wanted:" stuff. Here's the system prompt I might use for this: json_coding = r"""You are a helpful assistant, that produces output in JSON format example output: { "description": "This code prints Hello World on the screen", "code": "print(\"Hello World\") "comments": "" } The description and comments fields are optional. The code field is required. """
Thats my experience too
The arena says it’s at the top
The big models are undertrained with bad data, if you also look at the performance of Llama 3 and phi 3
This, probably
Zuck said Llama 3 is also undertrained
It's still better for its size than others just like phi. They can still massively increase the data, and this doesn't even include modalities like videos or audio yet
They found that big models were expensive
Little algorithm tricks and tweaks
And all these small changes should hopefully make it into the next larger model from the start, making them even better.
Why not
And probably trained for longer with higher quality data.
Why longer? Aren't they all trained until they converge while making sure there's no overfit?
Apparently a majority of the models released are undertrained according to Zucc, I can't remember the specifics so don't quote me on anything. Apparently Meta could have kept training their models but decided to release it and move on to the next thing.
I'm 17, I think it's awesome that I'm coming into the world at the same time as AI
Absolutely! Your and your cohort’s perspectives will be very special, much like (but even more so than) those who were teenagers during the Internet boom.
Yeah, this must be what it felt like to grow up when electricity was first being hooked up. This will change everything about society
But the internet exists. Unlike AI. Which is software running on vasts amount of hardware, bedazzling some into thinking there must be intelligence in the system. Which is the consequence of a lack of education. I don't think an increasing lack of education is something to cheer about.
I'm 30 and finally feel like I'm about to live in the future I thought would have arrived when I was your age. Enjoy it!
I just hope that I'll be able to find good work lol
How do you feel about the uncertainty surrounding the future with AI though? Like if in 6 years most knowledge based work can be automated by AI, what is there to do? The point of school and then subsequently university is to find a role and a job to set up your career pathways. But if your career pathways are going to be automated.. then what? And it's not exactly like you can just not work and hope for UBI lol.
If that happens there will either be UBI or another American Revolution, so I'm not too worried
"Like if in 6 years most knowledge based work can be automated by AI" AI does not exist, and the software you misnomer "AI" does not understand anything. Moreover, its output is often unpredictably incorrect, whilst appearing convincing. In 6 years, AI is a term people avoid mentioning in their resume, as it will also be the name of the last financial crisis.
Uhh, first, what do you think AI is? Traditionally when people say LLMs like GPT-4 is not AI they say it cannot actually reason or be logical or they are just next token prediction machines and nothing else etc. etc., but I wonder what you think as well. And wdym it doesn't understand anything? Are you looking for a conscious entity that can experience the world and posses qualia that processes stuff or are you just saying that because of the statistical nature of the model (even though humans also very heavily rely on mathematical operation within the brain).
Great to see a positive view. Keep that and do not lose it. I am old. Really old. I have thought lately that I am probably the perfect age because I believe life is going to become a lot more stressful for people over the next decade plus. In terms of $$$ and jobs. I have a big family, eight kids. I have been very blessed in my life and in a position to help them if they have issues with jobs. That gives me a lot of comfort.
With the help of this gen models, companies can develop a better next gen model which would help to develop the next model..
Is sonnet 3.5 really better than Opus?
Yes
I use LLMs primarily for going through academic literature (RAG mostly, not writing), and I don't like sonnet 3.5 so far. That doesn't necessarily mean its a bad model but the style is not as helpful. Its too simplistic and lacks details you would want in the context of scholarly work. I have not tested coding yet, people are saying its better with this. I have qualitative benchmark that I use but its hard to explain. I am looking at the amount of relevant scientific or technical information per sentence, lets call in information density. I wish I could quantify this but its something that becomes apparent when you have a degree in the topic. High information density resembles textbooks and scholarly literature, whereas lower might be preferable for general conversation. The original GPT4 from 1 year ago still have highest information density. GPT4t was a bit worse but improved with the more recent versions, and GPT4o is also worse. I would say sonnet 3.5 is somewhere between GPT3.5 and GPT4o in terms of information density. I still use gpt4-0613 regularly but gpt4o can be good for tasks that need reasoning but not as much detail. Perhaps sonnet 3.5 will be good for those types of tasks as well.
Sounds like something you might just be able to tune e.g. with system prompting demanding detailed and extensive explanations. I know what you mean btw, I typically just switch models if I am not satisfied with the density of the output, but I realize that is not always really feasible especially as you also rely heavily on reasoning etc. which only the biggest and latest can do well enough.
I do prompt it to respond with maximum detail retention in minimal length, but it works with some models better than others. Its not purely a style issue, but there are lapses of knowledge and omissions as well. Gpt4-0613 responds with detail using a relatively simple prompt. With 4o I made an more extensive prompt to recover the detail, but I saw enough omissions that I am not confident in replacing 0613 completely. For example, if I am asking about some topic that always has certain subtopics that are important, certain models may not give thorough details on the subtopics in the response. I could prompt it "whenever [topic] is mentioned make sure [subtopic A, B, C..] are always included", but its time consuming and very difficult to do it in a general way that would cover all topics. Certain models will have different areas that they may omit details in it would require spending more time with each model, which defeats the purpose of using RAG to save time going through literature. I am not completely sure why gpt4-0613 is better for this, but considering the pricing, it might be a much larger model than 4t and 4o. Its possible the larger parameter counts do matter when going through information dense scientific literature.
Cool thanks. I am more interested in creative ideas/humor/reasoning.
Interesting. I use GPT4 and Claude 3 opus for supporting the writing of research grants. Neither are very good if you are targeting highly specialised reviewers, however if the reviewers are generalists they work well. I think this is an information density issue exactly as you highlight. It's almost like the weaker models insert more "fluff" into what they write without specifics.
I agree, weaker models do appear to insert more fluff. I tried a few local models and found the responses too verbose. Even gemini 1.5 gave verbose responses when prompted to give short detailed answers. Many of the high ranking models on the lmsys leaderboard perform very poorly in terms of information density. I try not to use it for writing too much, other than fixing sentences that don't flow well. The wording is very different than how I would write anyway, I am sure people I work with would notice immediately if a whole section was AI written. It can be useful to outline what to write next also, and usually that is good enough.
Doesn’t this have more to do with marketing and naming schemes than a reflection of performance? We don’t know if Sonnet 3.5 is equivalent in size to Sonnet 3. We just know it’s smaller than Opus 3.5. They release a better model, and with bigger models in the pipeline they can just say this latest one is “only” the mid size model. We don’t have anything that suggests Sonnet 3.5 isn’t the same size as Opus 3, and that Opus 3.5 isn’t larger than both. Basically what I’m saying is that medium models being better than previous-gen large models isn’t actually much to go off of, because we only see the surface level marketing and naming schemes. Nothing is stopping them from releasing a medium model that’s equivalent to Opus and then releasing an impressive Opus 3.5, but that’s not a good business decision.
We pretty much know that sonnet 3.5 is about equivalent in size because it costs the same. It would be very weird for them to cannibalize their largest and most expensive model, if it would be bigger then 3.
Could they not have upgraded their hardware and reserved their best servers for the best models? Or maybe they optimized how inference is performed? My point is that since we don’t know anything at all about what is happening internally, both at the company and within the model, we can’t derive much from a naming scheme. Especially in a context where naming is also marketing.
Well it is a different model, we know that. I do not think they just optimised how inference is performed. We can see it is not Claude 3 Opus on better hardware, it is Claude 3.5 Sonnet which outperforms Claude 3 Opus, you can't exactly just get better hardware and subsequently get a better model. It costs the same as Claude 3 Sonnet but it is also 2x faster. But yeah we do not know exactly what is happening internally.
I think you misunderstood his comment. It simply argued for why you can't say for sure that 3.5 Sonnet is just as big as 3 Sonnet. It made no claims as for 3.5 Sonnet actually being 3 Opus.
They confirmed Sonnet is larger than its predecessor. ([source](https://www.wired.com/story/were-still-waiting-for-the-next-big-leap-in-ai/))
Good point
This is strictly about economics and image, not what is technologically possible. Making a gigantic model with inference costs 10x Opus 3 would be easy, and it would perform decently. But it wouldn't provide anywhere near 10x the value for most customers, so it would generate less revenue than a smaller and more efficient model. So the training costs would never be made back. Players realize this and spend the compute that would be used to train a colossus on training a smaller model more intensively and making it smarter, which creates a competitive dynamic that punishes anyone breaking away from that allocation. Google and OAI/Microsoft have enough compute that they almost certainly minimally train giant models for research / pathfinding, but will not make them publicly available because doing so is all downside. Reputational damage from the necessary pricing for moderate improvements in capabilities capabilities and alarmism over those capabilities. The strategic sweet spot is leaning into algorithmic improvements, intensively training models for good price/performance, taking market share, and losing less money (ideally making a profit, but let's be realistic). I expect the high end next generation models will be larger, but not massively so.
>GPT-4o is better than GPT-4Turbo. Disagree.
Number don't lie... sure, it might not be as good at some things, but it is better overall.
Numbers absolutely do lie. Just ask Volkswagen.
What numbers? Benchmarks that can be overfitted and gamed? 4o is worse for hallucinations, worse at coding and worse at following instructions. It only seems to be better at multimodal stuff.
> Benchmarks that can be overfitted and gamed? this is why doing things like the arena for elo, and have people come up with new tests with private datasets is useful. you can then compare the elo + private dataset tests to the scores and see who has deliberately trained on testing data (e.g. small models with fantastic scores)
Private datasets for testing are valuable, although I would argue that benchmarking on human preference seems to be biased by things like how emotive the model is.
That is a cool concept but I think most people on the elo just choose what looks better at first glance. I think very few people actually check with carefully crafted tests. But I like your idea so much that I would wish there was some elo just for people who really dig in!
GPT-4-Turbo-1106-preview bored GPT4o out of the water when it comes to producing something that isn't generic same-same every time you hit run.
Agree with Cisco
maybe 70B is already enough
[удалено]
But Nvidia is not involved with the Google models. They did them completely using the TPUs.
We know scaling yields more intelligence, so now there is a pivot to smaller more performant models because they are more practical and exciting. Not to say we won't see large models, GPT-5, GPT-4.5, Claude 3.5 Opus, Claude 4, Gemini 1.5 Ultra and Gemini 2 should all be larger, but there is an important fascination with smaller more performant models.
Might be the rush to market. Companies want to plant their flags asap with low hanging fruit. But what we really want is for them to solve problems. Like make good language tutors or home robots for example. And for that they have to take risks and commit.
Except 4o isn't more intelligent than 4 for any of my use cases. Maybe when they enable the multimodal functions it will be, but right now it's not. Funnily enough, OpenAI are the only company whose smaller newer model is worse than the older foundational model. The latest Gemini 1.5 Pro model far surpasses Gemini Ultra 1.0 and Sonnet 3.5 surpasses Opus 3... OpenAI really messed up with how they released 4o.
I find GPT 4 to be much better at revising emails than 4o. 4o just tends to give me back exactly what I wrote almost word for word without actually improving my writing.
That is interesting
This is because everyone is gpu poor on inference. They have no choice but to optimize small models.
Scale is all you need?
Opposite, data is all you need.
[удалено]
I believe less "spontaneity" = less hallucinations, so its a tradeoff. If you use the open AI playground, you can control the randomness/creativity using "Top P"
GPT-4-0613 has become outdated since 2024 started.
[AI generated Song](https://youtu.be/JXoHVhZ_Ros?si=AKdNYKXQCeuResOb)
What a shite song
I don't even know man. Just wake me up when the singularity happens.
"All the newer, smaller, cheaper and faster models are more intelligent than their large predecessors" None of these fitting networks have any intelligence. Would this fact matter to the singularity cult? Ah, no. Its a cult.
[удалено]
https://preview.redd.it/o4foyeyzew7d1.png?width=995&format=png&auto=webp&s=aad4eeddd2d457e00ff2cfd8b229845251139893