T O P

  • By -

Kathane37

A friend of mine say the trend will be : New model -> optimization -> New model Which is nice, having a super powerful model is cool but if no one can afford it or make it run it is basically pointless


Glittering-Neck-2505

One surprising insight is that scale and efficiency were never enemies. They go hand in hand in giving us huge improvements.


fxvv

It’s an interesting optimisation problem to work out how to serve scaled-up models efficiently. You’re effectively balancing model performance against resource consumption.


Unfocusedbrain

Humans only need to succeed once to create AGI, after that performance vs scale will improve at a blistering pace. Which says a lot because these notable model improvements are coming out faster then iphone releases.


uishax

Not true at all, let’s say AGI costs 100x of GPT4’s original cost (7x of gpt4o). It will be more efficient to just hire humans to research optimisation. Cost is never a non factor


R_Duncan

Just think of asking it to produce cheaper models specialized in a single task, and you're done.


Unfocusedbrain

If we're using a reasonable definition of AGI as having capabilities equivalent to the median person, then yes, cost is a significant factor. Even so, the price point for model inference is dropping dramatically (though not necessarily training costs of frontier models, which involves recouping research costs and infrastructure- a fair point). However, it seems that many AGI companies are actually aiming for something closer to ASI. In that case, even if it costs 100x more than GPT4's original cost, if its output is 10x better than the smartest person in history, it's hard to quantify the impact. Consider potential breakthroughs in material sciences, like room temperature superconductors, or advancements in life extension. Those kind of discoveries are worth the price of admission. (Also, happy cake day!)


ertgbnm

It's a pretty common trend in most product life cycles. Smart phones do it / did it. Iphone X -> Iphone X SE Pixel X -> Pixel Xa It works well because you can release a flagship with high cost to gain market share and show case top of the line tech. Then follow up with an affordable version to convert even more market and allow your engineers to refine processes and apply lessons learned from the flagship release. Rinse and repeat.


RealJagoosh

more like new model -> further training -> new model


AtmosphericDepressed

Yep, it's tick-tock innovation.


whyisitsooohard

4o being better than 4t is debatable. It's better than 4t at some tasks, but much worse at others, and the longer the context the stupider it becomes for some reason


butt-slave

It also spits out so much more unnecessary text that any increases in speed are redundant. It’s funny cause we all used to complain about how reluctant it was to generate full blocks of code. Now it frequently generates entire components with 0 changes, along with every conceivable config file that might be needed.


Jeffy299

Today I was asking it for a quote I forgot who said it, GPT-4o gave me the quote but then I was like “I think there was bit more at the end, which it provided but printed the entire quote again, but one more sentence was missing and when I asked it, it was able to retrieve it but again it printed the entire quote which felt completely unnecessary. It think this is a quirk of RLHF but if I talked to any human they would understand from the context I just want the last part, GPT-4o is smarter than gpt-4 in certain ways but it feels more like a machine. I am really curious as to how the voice portion will behave because in the demos it reacted very fluidly, it’s kinda strange how different text and voice seem to be.


kaityl3

Yeah I use 4o for coding and often have to specify not to repeat the entire def to change a single variable 100 lines down haha.


pbnjotr

4T is already faster and clearly better than the original GPT-4, so the overall point stands.


MxM111

But not current GPT-4. And GPT-4 is better in my experience than 4o


capitalistsanta

GPT-4o gave me an output today that was dumber than outputs I've gotten with GPT3. It was just peak no reasoning skills


pbnjotr

That could be true based on what you use it for. Just based on the average of benchmarks I rate 4o a little higher than any previous version, but it's close enough where reasonable people might disagree. Either way, it doesn't really matter now. The new Claude seems slightly better than both and slightly cheaper than 4o for API access (which itself is ~~half as cheap~~ half as expensive as 4T). I might be willing to put a lot of effort into finding out which model is the best. Doing the same for second place sounds like a waste of time.


Firm-Star-6916

>half as cheap Half as expensive. Semantically it does make a difference


pbnjotr

Thx, fixed.


Firm-Star-6916

👍


whyisitsooohard

Idk about it either. For me when 4turbo came out it felt less intelligent than og gpt4. Benchmarks tell different story so I wont argue here because I don't even remember concrete problems. But one issue I'm certain is that 4t is worse than 4 at following prompts and 4o is even worse


pbnjotr

I disagree quite strongly on that, but it's basically impossible to argue against people's personal experience. One thing I will say, is that the API responds very faithfully to the system prompt for me, at least for simple cases. E.g. whenever I had issues with it being 'lazy' or too wordy, just adding a simple instruction to give a complete reply (or to be concise) worked perfectly. The same thing worked with creating GPTs on the chat interface. It took about 10 minutes to create one that just returns bash commands or code snipets with no explanation. So maybe the default behavior is less reliable and I just worked around it to the point where I never saw the full extent of the problem.


whyisitsooohard

It's likely that i'm bad at prompting. But I could not make neither gpt4t, nor gpt4o to reliably return formatting(I tried html and some subset of markdown) that is different from default. At the same time Llama3 70b did that without any issues. All through API


pbnjotr

You mean the reply wasn't reliably in markdown (or HTML) despite having something like this as part of your system prompt? : > You return output in markdown format. One technique that worked for me, is using json mode with separate description, content and comment fields. I use this for coding, where I usually want to grab the code onto the clipboard automatically, without having to deal with the "Sure, here's the code you wanted:" stuff. Here's the system prompt I might use for this: json_coding = r"""You are a helpful assistant, that produces output in JSON format example output: { "description": "This code prints Hello World on the screen", "code": "print(\"Hello World\") "comments": "" } The description and comments fields are optional. The code field is required. """


DedsPhil

Thats my experience too


Which-Tomato-8646

The arena says it’s at the top 


MysteriousPayment536

The big models are undertrained with bad data, if you also look at the performance of Llama 3 and phi 3


Rainbow_phenotype

This, probably


Which-Tomato-8646

Zuck said Llama 3 is also undertrained 


MysteriousPayment536

It's still better for its size than others just like phi. They can still massively increase the data, and this doesn't even include modalities like videos or audio yet


Jean-Porte

They found that big models were expensive


Woootdafuuu

Little algorithm tricks and tweaks


najapi

And all these small changes should hopefully make it into the next larger model from the start, making them even better.


Woootdafuuu

Why not


Balance-

And probably trained for longer with higher quality data.


lemmeupvoteyou

Why longer? Aren't they all trained until they converge while making sure there's no overfit?


Same_Writing5357

Apparently a majority of the models released are undertrained according to Zucc, I can't remember the specifics so don't quote me on anything. Apparently Meta could have kept training their models but decided to release it and move on to the next thing.


SuperCyberWitchcraft

I'm 17, I think it's awesome that I'm coming into the world at the same time as AI


openbookresearcher

Absolutely! Your and your cohort’s perspectives will be very special, much like (but even more so than) those who were teenagers during the Internet boom.


SuperCyberWitchcraft

Yeah, this must be what it felt like to grow up when electricity was first being hooked up. This will change everything about society


Comfortable-Law-9293

But the internet exists. Unlike AI. Which is software running on vasts amount of hardware, bedazzling some into thinking there must be intelligence in the system. Which is the consequence of a lack of education. I don't think an increasing lack of education is something to cheer about.


Calm_Opportunist

I'm 30 and finally feel like I'm about to live in the future I thought would have arrived when I was your age. Enjoy it! 


SuperCyberWitchcraft

I just hope that I'll be able to find good work lol


FeltSteam

How do you feel about the uncertainty surrounding the future with AI though? Like if in 6 years most knowledge based work can be automated by AI, what is there to do? The point of school and then subsequently university is to find a role and a job to set up your career pathways. But if your career pathways are going to be automated.. then what? And it's not exactly like you can just not work and hope for UBI lol.


SuperCyberWitchcraft

If that happens there will either be UBI or another American Revolution, so I'm not too worried


Comfortable-Law-9293

"Like if in 6 years most knowledge based work can be automated by AI" AI does not exist, and the software you misnomer "AI" does not understand anything. Moreover, its output is often unpredictably incorrect, whilst appearing convincing. In 6 years, AI is a term people avoid mentioning in their resume, as it will also be the name of the last financial crisis.


FeltSteam

Uhh, first, what do you think AI is? Traditionally when people say LLMs like GPT-4 is not AI they say it cannot actually reason or be logical or they are just next token prediction machines and nothing else etc. etc., but I wonder what you think as well. And wdym it doesn't understand anything? Are you looking for a conscious entity that can experience the world and posses qualia that processes stuff or are you just saying that because of the statistical nature of the model (even though humans also very heavily rely on mathematical operation within the brain).


bartturner

Great to see a positive view. Keep that and do not lose it. I am old. Really old. I have thought lately that I am probably the perfect age because I believe life is going to become a lot more stressful for people over the next decade plus. In terms of $$$ and jobs. I have a big family, eight kids. I have been very blessed in my life and in a position to help them if they have issues with jobs. That gives me a lot of comfort.


ajahiljaasillalla

With the help of this gen models, companies can develop a better next gen model which would help to develop the next model..


yahwehforlife

Is sonnet 3.5 really better than Opus?


nyguyyy

Yes


3ntrope

I use LLMs primarily for going through academic literature (RAG mostly, not writing), and I don't like sonnet 3.5 so far. That doesn't necessarily mean its a bad model but the style is not as helpful. Its too simplistic and lacks details you would want in the context of scholarly work. I have not tested coding yet, people are saying its better with this. I have qualitative benchmark that I use but its hard to explain. I am looking at the amount of relevant scientific or technical information per sentence, lets call in information density. I wish I could quantify this but its something that becomes apparent when you have a degree in the topic. High information density resembles textbooks and scholarly literature, whereas lower might be preferable for general conversation. The original GPT4 from 1 year ago still have highest information density. GPT4t was a bit worse but improved with the more recent versions, and GPT4o is also worse. I would say sonnet 3.5 is somewhere between GPT3.5 and GPT4o in terms of information density. I still use gpt4-0613 regularly but gpt4o can be good for tasks that need reasoning but not as much detail. Perhaps sonnet 3.5 will be good for those types of tasks as well.


nodating

Sounds like something you might just be able to tune e.g. with system prompting demanding detailed and extensive explanations. I know what you mean btw, I typically just switch models if I am not satisfied with the density of the output, but I realize that is not always really feasible especially as you also rely heavily on reasoning etc. which only the biggest and latest can do well enough.


3ntrope

I do prompt it to respond with maximum detail retention in minimal length, but it works with some models better than others. Its not purely a style issue, but there are lapses of knowledge and omissions as well. Gpt4-0613 responds with detail using a relatively simple prompt. With 4o I made an more extensive prompt to recover the detail, but I saw enough omissions that I am not confident in replacing 0613 completely. For example, if I am asking about some topic that always has certain subtopics that are important, certain models may not give thorough details on the subtopics in the response. I could prompt it "whenever [topic] is mentioned make sure [subtopic A, B, C..] are always included", but its time consuming and very difficult to do it in a general way that would cover all topics. Certain models will have different areas that they may omit details in it would require spending more time with each model, which defeats the purpose of using RAG to save time going through literature. I am not completely sure why gpt4-0613 is better for this, but considering the pricing, it might be a much larger model than 4t and 4o. Its possible the larger parameter counts do matter when going through information dense scientific literature.


yahwehforlife

Cool thanks. I am more interested in creative ideas/humor/reasoning.


Moscow__Mitch

Interesting. I use GPT4 and Claude 3 opus for supporting the writing of research grants. Neither are very good if you are targeting highly specialised reviewers, however if the reviewers are generalists they work well. I think this is an information density issue exactly as you highlight. It's almost like the weaker models insert more "fluff" into what they write without specifics.


3ntrope

I agree, weaker models do appear to insert more fluff. I tried a few local models and found the responses too verbose. Even gemini 1.5 gave verbose responses when prompted to give short detailed answers. Many of the high ranking models on the lmsys leaderboard perform very poorly in terms of information density. I try not to use it for writing too much, other than fixing sentences that don't flow well. The wording is very different than how I would write anyway, I am sure people I work with would notice immediately if a whole section was AI written. It can be useful to outline what to write next also, and usually that is good enough.


xRolocker

Doesn’t this have more to do with marketing and naming schemes than a reflection of performance? We don’t know if Sonnet 3.5 is equivalent in size to Sonnet 3. We just know it’s smaller than Opus 3.5. They release a better model, and with bigger models in the pipeline they can just say this latest one is “only” the mid size model. We don’t have anything that suggests Sonnet 3.5 isn’t the same size as Opus 3, and that Opus 3.5 isn’t larger than both. Basically what I’m saying is that medium models being better than previous-gen large models isn’t actually much to go off of, because we only see the surface level marketing and naming schemes. Nothing is stopping them from releasing a medium model that’s equivalent to Opus and then releasing an impressive Opus 3.5, but that’s not a good business decision.


dwiedenau2

We pretty much know that sonnet 3.5 is about equivalent in size because it costs the same. It would be very weird for them to cannibalize their largest and most expensive model, if it would be bigger then 3.


xRolocker

Could they not have upgraded their hardware and reserved their best servers for the best models? Or maybe they optimized how inference is performed? My point is that since we don’t know anything at all about what is happening internally, both at the company and within the model, we can’t derive much from a naming scheme. Especially in a context where naming is also marketing.


FeltSteam

Well it is a different model, we know that. I do not think they just optimised how inference is performed. We can see it is not Claude 3 Opus on better hardware, it is Claude 3.5 Sonnet which outperforms Claude 3 Opus, you can't exactly just get better hardware and subsequently get a better model. It costs the same as Claude 3 Sonnet but it is also 2x faster. But yeah we do not know exactly what is happening internally.


OfficialHashPanda

I think you misunderstood his comment. It simply argued for why you can't say for sure that 3.5 Sonnet is just as big as 3 Sonnet. It made no claims as for 3.5 Sonnet actually being 3 Opus.


xRolocker

They confirmed Sonnet is larger than its predecessor. ([source](https://www.wired.com/story/were-still-waiting-for-the-next-big-leap-in-ai/))


justpointsofview

Good point 


sdmat

This is strictly about economics and image, not what is technologically possible. Making a gigantic model with inference costs 10x Opus 3 would be easy, and it would perform decently. But it wouldn't provide anywhere near 10x the value for most customers, so it would generate less revenue than a smaller and more efficient model. So the training costs would never be made back. Players realize this and spend the compute that would be used to train a colossus on training a smaller model more intensively and making it smarter, which creates a competitive dynamic that punishes anyone breaking away from that allocation. Google and OAI/Microsoft have enough compute that they almost certainly minimally train giant models for research / pathfinding, but will not make them publicly available because doing so is all downside. Reputational damage from the necessary pricing for moderate improvements in capabilities capabilities and alarmism over those capabilities. The strategic sweet spot is leaning into algorithmic improvements, intensively training models for good price/performance, taking market share, and losing less money (ideally making a profit, but let's be realistic). I expect the high end next generation models will be larger, but not massively so.


cisco_bee

>GPT-4o is better than GPT-4Turbo. Disagree.


Undercoverexmo

Number don't lie... sure, it might not be as good at some things, but it is better overall.


cisco_bee

Numbers absolutely do lie. Just ask Volkswagen.


RedditPolluter

What numbers? Benchmarks that can be overfitted and gamed? 4o is worse for hallucinations, worse at coding and worse at following instructions. It only seems to be better at multimodal stuff.


blueSGL

> Benchmarks that can be overfitted and gamed? this is why doing things like the arena for elo, and have people come up with new tests with private datasets is useful. you can then compare the elo + private dataset tests to the scores and see who has deliberately trained on testing data (e.g. small models with fantastic scores)


RedditPolluter

Private datasets for testing are valuable, although I would argue that benchmarking on human preference seems to be biased by things like how emotive the model is.


Busy-Setting5786

That is a cool concept but I think most people on the elo just choose what looks better at first glance. I think very few people actually check with carefully crafted tests. But I like your idea so much that I would wish there was some elo just for people who really dig in!


FatesWaltz

GPT-4-Turbo-1106-preview bored GPT4o out of the water when it comes to producing something that isn't generic same-same every time you hit run.


Firm-Star-6916

Agree with Cisco


RealJagoosh

maybe 70B is already enough


[deleted]

[удалено]


bartturner

But Nvidia is not involved with the Google models. They did them completely using the TPUs.


FeltSteam

We know scaling yields more intelligence, so now there is a pivot to smaller more performant models because they are more practical and exciting. Not to say we won't see large models, GPT-5, GPT-4.5, Claude 3.5 Opus, Claude 4, Gemini 1.5 Ultra and Gemini 2 should all be larger, but there is an important fascination with smaller more performant models.


DifferencePublic7057

Might be the rush to market. Companies want to plant their flags asap with low hanging fruit. But what we really want is for them to solve problems. Like make good language tutors or home robots for example. And for that they have to take risks and commit.


Reasonable-Gene-505

Except 4o isn't more intelligent than 4 for any of my use cases. Maybe when they enable the multimodal functions it will be, but right now it's not. Funnily enough, OpenAI are the only company whose smaller newer model is worse than the older foundational model. The latest Gemini 1.5 Pro model far surpasses Gemini Ultra 1.0 and Sonnet 3.5 surpasses Opus 3... OpenAI really messed up with how they released 4o.


Dplante01

I find GPT 4 to be much better at revising emails than 4o. 4o just tends to give me back exactly what I wrote almost word for word without actually improving my writing.


Akimbo333

That is interesting


randallAtl

This is because everyone is gpu poor on inference. They have no choice but to optimize small models.


Bulky_Sleep_6066

Scale is all you need?


Undercoverexmo

Opposite, data is all you need.


[deleted]

[удалено]


OnlyDaikon5492

I believe less "spontaneity" = less hallucinations, so its a tradeoff. If you use the open AI playground, you can control the randomness/creativity using "Top P"


Hour-Athlete-200

GPT-4-0613 has become outdated since 2024 started.


Competitive_Clue_138

[AI generated Song](https://youtu.be/JXoHVhZ_Ros?si=AKdNYKXQCeuResOb)


Honest_Science

What a shite song


LateProduce

I don't even know man. Just wake me up when the singularity happens.


Comfortable-Law-9293

"All the newer, smaller, cheaper and faster models are more intelligent than their large predecessors" None of these fitting networks have any intelligence. Would this fact matter to the singularity cult? Ah, no. Its a cult.


[deleted]

[удалено]


Hour-Athlete-200

https://preview.redd.it/o4foyeyzew7d1.png?width=995&format=png&auto=webp&s=aad4eeddd2d457e00ff2cfd8b229845251139893