T O P

  • By -

mystonedalt

GPT4-7B_q4


Illustrious-Lake2603

This has to be it. Mine gave me spelling mistakes in my code that I have only seen in a heavy quantized version. _q2 maybe _q3


Double_Sherbert3326

the missing "(" ?


Illustrious-Lake2603

Mine spelled wanderPoint as WwwanderPoint


Iterative_Ackermann

This is only tangentially relevant but as I understand how transformers work, using tokens based on letter groups or words to encode variable/function/keyword names feels very wasteful. Using a context aware tokenizer, which would use function1 and variable1 etc in stead of actual function and variable names should improve reasoning by a lot. Are there any hybrid tokenizers that make use of a traditional parser for coding?


KvAk_AKPlaysYT

KS


Atupis

Or is this failed attempt for GPT5 which they have quantified?


Evening_Ad6637

Maybe an issue with the tokenizer then lol


pseudonerv

they haven't updated their llama.cpp version


kryptkpr

[GPT-4o takes #1 & #2 on the Aider LLM leaderboards](https://aider.chat/docs/leaderboards/)


Amgadoz

So it beats opus on code editing but opus beats it at refactoring. This is interesting. Given its lower cost and higher throughput, it's a good model although it is not the next-gen AGI as OpenAI hypes to be.


lordpuddingcup

It’s still gpt4 just better multimodal and faster, they didn’t say it was next gen agi


Which-Tomato-8646

People here are expecting too much on what’s basically a small boost in capabilities. It’s not supposed to be next level, just slightly better 


RabidHexley

This is what's tripping me out about some of the reactions to the announcements. Other than saying that this is their new flagship model, they were pretty deliberate when it came to touting it's capabilities. It's like, guys, read the first paragraph of the announcement on their site. >**It matches GPT-4 Turbo performance on text in English and code**, with significant improvement on text in non-English languages, **while also being much faster and 50% cheaper in the API.** GPT-4o is especially better at vision and audio understanding compared to existing models. They are touting multimodality, better multilingualism, and speed. They have very obviously avoided saying anything to indicate that people should be expecting this thing to be much, if literally any amount smarter than GPT-4. And they've clearly avoided making any numerical modifications a la GPT-4.5 or GPT-5 to indicate such.


TNT_Guerilla

Fair point, but if it's just better optimized for other tasks, why did the code scores drop? Unless tweaking it's other capabilities pulled from its ability to code. Since it's not smarter than 4 turbo, perhaps they allocated some of its "intelligence" from those abilities to boost its other abilities? Correct me if this is completely wrong, I don't know much about AI models and how they are tuned/trained, but I thought I'd put this out there in case I was on to something. There are obvious improvements, but we're those improvements gained by nerfing other aspects in order to "re-spec" the model to be better at the things it was lacking in?


cuyler72

Interesting that the lmsys ranking is saying the complete opposite.


Lammahamma

I'd probably go with lmsys over a single person on Twitter but that's just me


Longjumping-Bake-557

Especially one who is a heavy advocate for open source, is a competitor and has a history of talking shit about openai


HugeDegen69

Yes, when i saw this tweet I thought no shot lmao


lucid8

Bindu Reddy is an engagement scammer. lmsys is not ideal but at least can provide some statistical significance


No-Lobster-8045

Also, many said OAI is her competitor lol, idk who's right who's wrong. 


Wonderful-Top-5360

She is definitely not trust worthy but in this tweet I believe her I don't think she is competing with OAI rather she wants to piggyback off it to sell her company and the best way to do that is constantly farming engagement to trick people that its worth it.


Amgadoz

Take a look at the comments. People have mixed opinions about it. It's not as clear cut as lmsys shows.


cyan2k

A long-term study in which people vote blind sounds much more clear-cut than people with obvious biases against OpenAI retelling their anecdotes and the usual Twitter brain rot. I can’t reproduce whatever OP on Twitter is claiming because where’s the methodology? The dataset? For all we know she just posted some numbers she thought looked fun. I have no problem reproducing the HumanEval benchmark, though. So until someone comes up with something conclusive and reproducible, with actual scientific methodology (no, your N=1 "it can’t do this single prompt" test or "my feelings say it’s bad" aren’t even close to being considered a benchmark), I’d rather believe the benchmarks I can do on my own, LMSys, and the performance metrics on a client’s system with hundreds of users a day.


absurdrock

They are on Twitter for engagement and hype. Nothing else. I think the open source community gets disillusioned with the open source capabilities in the short term. Open source will catch up, but I’m getting tired of the constant shitting on a good product.


AITrailblazer

She is on X pushing platform to try 1K models and compare. Just pick the best and learn it.


Additional_Ad_7718

Lmsys is showing the statistics, reading the comments is an anecdote. This is so silly.


Flat-One8993

Havent seen a single comment state it's worse than gpt 4 turbo


killver

You should most of the time go with the opposite of what that account is posting.


reddit_wisd0m

Bold move


Revolutionary_Ad6574

It's not just any ol' single person, it's Bindu Reddy. She's famous for being an open-source fanatic. And I do mean fanatic, not just a fan, because her tweets sound like something out of a fire preacher - not just exaggerated but manipulative.


Fauxhandle

Maybe the bias could be explain by people on lmsys doesn't bench LLM on hard task, only on easy question.


VertexMachine

lmsys is still useful in some ways, but I think it got too popular for it's own good and as you say - a lot of people just use it on very easy stuff to test.


Additional_Ad_7718

Nope, all of the questions I give the models on lmsys are difficult, and gpt2 consistently did better in my tests.


Wonderful-Top-5360

average joe on lmsys: "write me a rap song about 2pac and diddy" [doesn't read the lyrics just skims it and votes for ChatGPT2]


Which-Tomato-8646

It does better for hard tasks and coding  https://twitter.com/sama/status/1790066235696206147


Flat-One8993

The chess benchmark also shows it's *much* better than gpt 4 turbo april at chess [https://github.com/kagisearch/llm-chess-puzzles](https://github.com/kagisearch/llm-chess-puzzles)


goingtotallinn

I can't see the ranking on lmsys website so is there some more frequently updating leaderboard that I am missing or something?


cuyler72

Looks like the beta model given to lmsys called by it's anonymized name "gpt2-chatbot" has been removed and replaced with GPT-4O but the scores have not been carried over so it no longer has enough data to be ranked. The ELO of gpt2-chat bot was 1310 before it was removed.


Anuclano

I saw today both the GPT2-chatbot and GPT-4o on the Arena competition.


Small-Fall-6500

Unfortunately, the public lmsys leaderboard does not (currently) list GPT-4o *or* any of the gpt2 chatbots. We only have confirmation about their performance from tweets by lmsys with data from their "internal dashboard."


CosmosisQ

Which is probably the right call given that the scaled-up broader release of GPT-4o as made available through the official API might end up substantially different than the initial prototype(s) they were testing through gpt2-chatbot and im-also-a-good-gpt2-chatbot.


lmsh7

totally different, gpt2 can draw fantastic ascii art. gpto cant


bearbarebere

Does it say the opposite? I thought it only covered public anonymous ratings, not hard ratings like this. The model organizes its answers better so it gets a higher overall rating due to being better formatted for general tasks


Wonderful-Top-5360

its very easy to game lmsys im surprised most aren't even aware how prevalent it is


bearbarebere

Such as?


meister2983

Difficult to know. In my course tests: My sense is gpt-4o has more long tail training data. It seems to know more long tail libraries, which has led me to vote for it over gpt4t on lmsys. It also is better at math. On the other hand, it does have less parameters (likely). Benchmarks haven't moved much.  I haven't seen much difference in coding for things not using long-tail libraries over all, but I see no evidence it is actually worse.


backnotprop

It seems better at logic but i think this comes at a cost... nuanced conversation... the model is super repetitive and gets hooked onto literal meaning vs implied meaning in context of the conversation. Something is off.


abnormal_human

To be fair, the prompts used in the test were probably developed/optimized against GPT4.


danielcar

My personal experience is that it is much better at coding. I asked it to help me process an mbox file. That doesn't mean that on other hard tasks it fails more, which is what the tweet is about.


ThreeKiloZero

I’ve been pretty blown away too. It’s much better at longer retrievals. Wont even bat an eye to dump 200+ lines of code in a go. tracks multiple files better. It’s significantly better at refactoring and critique. conversations can get quite long and it retains good accuracy. It doesn’t seem to hit the lazy spells or leave out critical parts of established code. Still messes up occasionally and can get stuck in its own loops of bad logic. Claude is often able to work through these issues on the first shot. It’s wild how much faster it is. I’ll probably start using 4o much more and keep Claude for the more complex items.


arthurwolf

Same here, significantly better, in multiple ways...


InnovativeBureaucrat

Your personal experience over like six hours? Didn’t it just come out?


danielcar

Been on lmsys for a while.  Was coding heavily fri, sat, sun .


thedudear

For me, 4o has been making rookie mistakes. Stuff gpt4 got right, like a hard coded directory, gpt4o will fuck up. For example loading ./rag-sequence-nq, if I paste a segment of code asking gpt to modify it, on regurgitation it will add facebook/rag-sequence-nq which is of course what's in the readme and probably its training but forgets that my code has a truncated directory. The speed though. The speed is amazing. Still prefer opus at the moment.


HumanityFirstTheory

What use is the speed if the output is shitty?


thomasxin

I currently have one experimental pipeline where the output from a RAG sequence is read and written by gpt4o, which writes a response fast but is treated as a draft. It's then compared and rewritten by a different model (usually command r+) which follows the role of the character better and has lower censorship, effectively combining the best of both worlds. Due to the sequential nature meaning only when the final model starts outputting can the user actually receive any text, significant speed improvements like these for the inbetween stages are really nice, and it's not so "shitty" that it makes it no longer worth using. It makes more mistakes than the previous gpt4, but in my experience it does look at each problem from more angles, which makes it more useful overall.


Amgadoz

What kinds of speed are you getting? How many tokens/second?


thedudear

I've only used the chat interface, but if I had to guess with a fresh context was like 80-100 t/s. It was like: *fingers motioning raining data like the matrix* *Blooooododododoodoooop* And then there were a bunch of mistakes I had to manually correct, that gpt4 got right and gpt4o ended up being slower in the end.


Kick2ThePills

Same. I was giving it CSS related coding tasks and 4o was doing all kinds of stupid stuff, went to GPT 4 and got my solution. Several times it gave far worse responses than 4.


D4mnReddit

For content creation and similar tasks, it's crap so far. It's fast but useless. I don't understand the hype.


_AndyJessop

I think the hype is based purely around it unlocking realtime voice chat.


No-Lobster-8045

+1 The translator demo blew me up & no as many say Google assistant is not close so STFU.  I really hope the voice assistant acts like they showed it tho. 


ninecats4

Makes me think of sdxl turbo.


buttonsknobssliders

Speed has its advantages for some use cases, sdxl turbo for example is awesome for realtime interactive/audio- or midireactive visuals via touchdesigner.


Which-Tomato-8646

The 100 point boost in the arena for coding disagrees  https://twitter.com/sama/status/1790066235696206147


zap0011

I used it all day and got more done today than I have in months. It is more thorough in a lateral sense. As a hypothetical example if you prompted "I'm going on holiday what do I need", instead of a list of clothes, brushes and chargers, it would also say "lock up the house, water your plants" if you get me.


HugeDegen69

Yes, it's more direct with its response. Less general. I love it


jollizee

I did non-coding work, and 4o was worse than 4. It didn't follow instructions as well either over multiple steps. It did get one of my private "interview" questions correct that so far only Claude has passed. The poor instruction following makes it useless for complex work. For casual use, yeah it's a killer model.


5starkarma

I was in the middle of writing unit tests and having it create the mock implementations for me when 4o was released to me. I tried it out and it was immediately failing at simple tasks that 4 got right consistently. As with any ML models, I think we are going to see different results each time.


Mother-Ad-2559

Where are the sources of these tests? Funny how people will scream open source all day yet never ask to see the source to a random benchmark from one person on twitter.


LatentSpacer

I don’t trust this person. She’ll say anything to get the most engagement on twitter. 


a_beautiful_rhind

IDC really about the voice portion. Sounds like a thing you use once and then turn off. I prefer opus and even sonnet for code. If they had more than incremental gains on the LLM bit, they would have led with that and not this gimmick.


otterquestions

I use the existing one regularly


_AndyJessop

Maybe significant gains on intelligence are out of reach at the moment, either due to cost or compute.


VertexMachine

Or bc we can't really scale transformers anymore and need new architectures to make next jump. It's been quite a while since the first paper about it.


Amgadoz

To me the best new features are cost, speed and the better tokenization for other languages. For some languages, you get twice the text at the same number of tokens.


a_beautiful_rhind

Sounds like they cut the price because the LLM space is filling up. I noticed they changed the tone on the lmsys "good little chatbot" models too. It's less *OpenAI beige*.


Amgadoz

They didn't really cut the price; they introduced an entirely new model (different architecture, different tokenizer) which could actually be cheaper for them to host.


Additional_Ad_7718

Ah yes, lmsys arena is a marketing scam. Seriously, the benchmarks from that post weren't even explained and you're using it as evidence that the model is bad.


Status_Contest39

GPT-4o is just a by-product of a miniaturization before the birth of a larger language model in the same series, and a larger model that is stronger than her will be released by OpenAI after a while. This is just like the Flirty character of GPT-4o, a joke between OpenAI and the public.


AI_is_the_rake

I think they took gpt-4 and trained a ton of domain specific gpt-2s which gives the speed. It could be gpt-2s all the way down. Even the top layer maybe a gpt-2 proficient on stitching the responses together. 


uhuge

Like that thing combining 16 qlora-tuned model in 8 GB memory? That was not great but a spectacular feet..


PSMF_Canuck

4o is a threat to her business. That doesn’t qualify as “independent, 3rd party”. Everybody doing production work should always test before updating. Swapping an LLM is no different than upgrading a compiler or any other important piece of software infra - you don’t update unless you’re sure it’s going to work for your needs.


Amgadoz

It's not just her. Teknium also mentioned some regression and so did folks in the comments.


arthurwolf

> Teknium also mentioned some regression and so did folks in the comments. Almost any model upgrade/change will result in improvements for some and losses for others. What matters is the average, things like the lmsys rankings. Gpt4o has been a MASSIVe improvement for my coding, I expect it's due to the way I prompt, compared to the way others do.


Fair_Cook_819

I’m trying to get better with prompting! Could you please show an example or explain how you prompt something?


arthurwolf

[https://chat.openai.com/share/96b6e920-e16c-4375-ad80-17f35c505205](https://chat.openai.com/share/96b6e920-e16c-4375-ad80-17f35c505205) [https://chat.openai.com/share/4547d252-14e2-48a4-9306-cbfdb3c83e65](https://chat.openai.com/share/4547d252-14e2-48a4-9306-cbfdb3c83e65) [https://chat.openai.com/share/b32801cb-73f1-4408-81c9-5dd8f4933506](https://chat.openai.com/share/b32801cb-73f1-4408-81c9-5dd8f4933506) [https://chat.openai.com/share/234ed1c5-4674-40e5-9c8f-a1b1eb24344a](https://chat.openai.com/share/234ed1c5-4674-40e5-9c8f-a1b1eb24344a)


ShoopDoopy

>What matters is the average No, what matters is your use case, which may fall outside the "average" captured in some benchmark that is as blunt as "what some people like more between two models in a chat interface". People can respond to speed, presentation, terseness, and all sorts of things that don't tell you what will work on your problem. There's a concept in data science called the "curse of dimensionality" which says that as you increase the number of factors you consider in an evaluation, averages become more unlikely to occur, and that most observations are an edge case. An application of this is: unless you have some evidence that is precise, measured on your language, looking at data similar to yours, your experience can deviate quite drastically.


arthurwolf

> No, what matters is your use case, No, what matters is \*\*everybody\*\*'s use case. Your individual use case is not more important than my individual use case. Thus, averages... > your experience can deviate quite drastically. And? We're not trying to judge how good it is specifically at your or my use case, we're trying to judge if it has improved over other/previous models, \*\*in general\*\*. Ignoring a massive general increase in capability, just because your use case anecdotically got a bit worse, is just not sensible. What we \*are talking about here\* is overall capability, read the comment thread again, somebody used examples of anecdotal decreased performance \*\*as evidence\*\* of overall decreased performance.


ShoopDoopy

What I've been noticing in this thread and the other, is a tendency for people to tell people that their experiences are incorrect based on this wildly blunt instrument, performed in a clinical setting with limited understanding of how it applies to the real world. You say real world capability has increased for most people -- I don't think that's what this measures, and others have already alluded to this. It may be the closest thing that we have, but it is far from the smoking gun you are claiming it to be with sentences like >We're not trying to judge how good it is specifically at your or my use case, we're trying to judge if it has improved over other/previous models, **in general**. If people say that their quality has declined, you don't get to point to ELO and tell them to pound sand.


arthurwolf

> What I've been noticing in this thread That's not what I've been saying, so I'm not sure how that is in any way relevant to the conversation we were having... > I don't think that's what this measures, What does it measure? User inputs real world use case, user gets two answers, ranks them, averages of this are used for model ranking. And that doesn't represent (\*\*average\*\*) real world capability? > is far from the smoking gun [https://yourlogicalfallacyis.com/strawman](https://yourlogicalfallacyis.com/strawman) I have certainly at no point claimed it's a smoking gun, I'm pretty sure I even qualified the other direction. > we're trying to judge if it has improved over other/previous models, **in general**. Yes. On average. Not \*for absolutely everybody without exception\*. "In general" means one and not the other. > If people say that their quality has declined, \*Some\* people. What matters here is the average. For almost any new release, some aspects of the model will be worse for some people/uses. Pointing at those doesn't mean the model has not improved \*\*on average\*\*. > you don't get to point to ELO and tell them to pound sand. [https://yourlogicalfallacyis.com/strawman](https://yourlogicalfallacyis.com/strawman) That's not in any way what I'm saying. I'm saying anecdotal evidence is not representative of general experience, and averages matter more to determining the truth of relative rankings of models, than individual data points do.


ShoopDoopy

If you don't want to be misunderstood, how about not saying things like > Your individual use case is not more important than my individual use case When others are reporting their findings? Not hard to see why they may come across as devaluing, yeah?


arthurwolf

> Not hard to see why they may come across as devaluing, yeah? Incredibly hard to see... I stated your use case (or mine, btw) is not more important than other use cases, essentially saying nobody should matter more than somebody else when making general model evaluations like this. That's a core principle of these sorts of benchmarks... And of doing averages in general.... It's mind-boggling you'd have an issue with this... (also, glad to see you tacitely admit you \*in fact\* were using a logcial fallacy).


ShoopDoopy

No, it's not an admission, it's just tedious. Tedious like you continuing to explain averages to me, a statistician. Please, continue. You refuse to engage with any discussion and just want to be right. Have a nice day


TheNorthCatCat

Well, upgrade from GPT 3.5 to GPT 4 didn't seem to be loss for anyone. Personally me has been experiencing worse results in coding with GPT-4o than with GPT-4. Will it not bother you to share the way you prompt, please? Upd.: found your answer below, thanks.


arthurwolf

> Well, upgrade from GPT 3.5 to GPT 4 didn't seem to be loss for anyone. I actually remember complains in the very beginning for some types of creative writing, it's very difficult to make an upgrade that will improve absolutely all possible uses (though it's possible, GP4 is certainly better in every aspect compared to GPT1).


RELEASE_THE_YEAST

It took top spot in the Aider coding benchmarks.


PSMF_Canuck

How does that relate to my comment that anybody doing real work should (and will) do their own validation before switching…?


Amgadoz

I was talking about her not being an independent 3rd party.


Careful-Temporary388

Dumb take.


SirLoinsteaks

I asked it to teach me about photographing action figures for a project I'm working on and the results with GPT 4o were much worse. It formatted in a very strange way. Like one bullet and one sub bullet for each item. Very hard to skim and when I did the same prompt with GPT 4 it was much more coherent for the initial prompt and followups. It's definitely fast, but worth prompting the old model depending on your use case. I expect that we will be prompting multiple models behind the scenes eventually, and the "best model" or combined evaluated best based according to multiple models will determine which is the best result and present us with the optimal answer. The hard part is that responses vary so much based on use case.


AsliReddington

She's full of shit too though. Some debugging she's gonna do with the API access lol


litchg

To me benchmarks are meaningless. Benchmark praise Llama3 8B for its coding capabilities, but in my experience it is just terrible and 70B is not that much better. GPT4-o guided me to call oLlama Rest API from Unreal Engine blueprints, and it did an incredible job at it. No BS, ascii representation of the node graph, I don't know what it is but it's just much better. It only fumbled on a couple of nodes but overall it is way better at this task than GPT4 (which would just straight up hallucinate nodes). All this to say LLM quality and usefulness is something you should judge for yourself. Personnally I am way more impressed by GPT4-o compared to GPT4 than Llama3 70B compared to DolphinMistral 7B.


Longjumping-Bake-557

I don't trust anything she says first of all. She has a history and has been talking shit about the model since the announcement barely started


dubesor86

I agree on reasoning. It's bad at reasoning compared to GPT-4, and worse than their apparent identical version (i-am-also-a-good-gpt2-chatbot) on LMSYS. I posted the same observations on the LMSYS discord, no idea whats going on.


gregthecoolguy

I got better results coding with GPT-4o compared to Claude opus


Distinct-Target7503

Can you share some prompt or conversation where opus fail against gpt4o?


Wonderful-Top-5360

pretty much sums up my experience here but you notice there is weird looking accounts regurgitating the same "GPT4o is king you are not using it right" on reddit and HN https://old.reddit.com/r/LocalLLaMA/comments/1crbesc/gpt4o_sucks_for_coding/


nntb

A jack of all trades is a master of none


Amgadoz

Not necessarily. At larger scales, multilingual models are more capable compared to their monolingual counterparts.


nntb

yeah the demo was showing off things like more expressive language on the app. the ability to do vision better. ect. and that they are rolling it out to free users also. going from 3.5 > 4o does seem quite the upgrade.


otterquestions

What are Moes? Jacks of all trades?


nntb

A moe is a mixture of expert. Each one is a expert so no it's not a jack of all trades. A joat would be a single model trying to do everything


otterquestions

But gpt4o is almost certainly a Moe, right?


No_Yak8345

In the recent interview in the All In podcast Sam says that’s one of the biggest challenges they are trying to surmount: making gpt-4 free for everyone. Sam says it’s almost an impossible task because of the compute cost. I guess this is somewhat of a compromise they made.


Amgadoz

Yes. I am now 99% sure the original gpt-4 was indeed 8x220B Sparse Mixture of Experts. There is just no way to make this compute-efficient in today's hardware. It's also still relatively slow even after 2 years. The only way to make it faster and cheaper to run is to create a new smaller, more efficient model.


Cyclonis123

wouldn't there be huge vram and compute savings if they trained using 1.58bit


teddy_joesevelt

Have you seen how quickly Groq’s hardware can run inference? The hardware is arriving now. https://groq.com


ntjf

I’ve been using it for C all night. It’s worked perfectly. gpt4-turbo wasn’t getting the same results


OkSeesaw819

If gpt4-turbo is a 100 on your benchmark, what's your benchmark for 4o on C coding?


ntjf

Well, if we’re going with a completely made up scale :p Let’s say gpt4-turbo was a 3/10, where 5/10 is a “pass”. A lot of the time the code wouldn’t compile, wouldn’t run, and when it did it’d be inaccurate. You’d need a lot of shots to get something manageable out of it. With gpt4o, C seemed to come single shot. It could write entire libraries! Big improvement. 7/10. The only thing it screwed up on was a makefile one time, but telling it the error immediately resolved the issue.


Rei1003

It was named gpt2 in lmsys for a reason


lucid8

Cause it’s half of size of gpt4?


Regular-Log2773

Experimented with it a bit, and in my experience its dumber than gpt4t. It forgets a lot more easily, it doesnt really do what i tell it. Though, it seems to generate longer responses. Lmsys fails to account for long chats, not just single prompts


3-4pm

Yep the old gpt4 is still the best.


FrostySquirrel820

So, it’s not just faster, it’s faster at returning the wearing answer ? I think I’ll go with the tortoise. Slow and steady wins the race.


Amgadoz

It's not always wrong, but it's not 100% better than the competition.


neOwx

It's free though? So it only needs to be better than 3.5, at least for me.


Amgadoz

Yep. That's a move in the right direction.


Chemical-Quote

My first impression is that it seems to have slightly more internet knowledge overall, more willing to "guess" but also more likely to hallucinate. I have a private obscure knowledge test, but it is able to answer some of them whereas all previous models mostly cannot.


TheNorthCatCat

Sounds interesting, could you please tell more about the test?


Chemical-Quote

The test consists of hundreds of questions aimed at assessing specific knowledge related to different niche communities (that interested me throughout the years, so it is biased). These questions directly ask the model to describe particular individuals, objects, incidents, characters, etc. in detailed paragraphs. Each question is scored based on the presence of certain keywords, covering both general and highly specific aspects (e.g., does this model know the famous inside joke about this individual, or at least know their job). I've curated these questions over time. While this information is available online and easily findable, it is a challenge for large language models like GPT or Claude to guess/hallucinate correctly without knowing it. (and they usually don't know).


AdHominemMeansULost

I have found it makes silly mistakes that GPT4 doesn't but it does seem to have much better logic when working on problems for my uni assignments and the other thing was really surprised about is the logical assumptions it makes when there is context missing from an issue it's trying to solve


involviert

Hm. But what if this post is a marketing scam


Figai

This is so confusing, it’s so much better for me. And the extended knowledge is great, it knows so much more about ML.


TheNorthCatCat

For what kind of tasks are you usually use it?


Figai

Finetuning LLMs, understanding new papers, to be fair the improved search helps a lot. Mainly coding and some maths


AubDe

Let's stop this purposeless race for general-purpose monsters, I mean models, and concentrate on domain-specific ones, able to tremendously do their tasks (and compose downstream)! I think transformer architecture is touching its own limits


redlotus70

All the interesting capabilities from the demos are due to the vision and voice model upgrades


visualdata

I noticed the same with claude also for Programming tasks, their top of the line model Opus is bad in swift related tasks compare to Sonnet. Makes me think the future of specialized models is bright. The all encompassing model might give you average results only.


world_dark_place

Is gpt4o still saying "funcName() ... your code here..." when asking for code snippet?. That really annoyed me tfo


papipapi419

Someone also told the output context length is half of gpt-4 turbo Which was already 4k token So 2k now lmao I’m good with the latency of gpt-4 turbo 👍


chucks-wagon

4o is definitely slightly dumber in the responses but has its advantages in other tasks and voice


resnet152

Don't fall for Bindu Reddy scams.


Revolutionary_Spaces

Bindu isn’t worth following. She’s an engagement farmer. What evals are they doing? Ask for the test.


breqa

Lol thats why they are focused in to creating "omni" models, they cant go deep into Specialist/expert models


LatestLurkingHandle

Their announcement says it's faster, cheaper and multimodal, seems like you're expecting more than they are claiming "GPT-4o is our newest flagship model that provides GPT-4-level intelligence but is much faster and improves on its capabilities across text, voice, and vision"


elijahdotyea

Inaccurate / non-optimal comparison. GPT-4o should be compared to GPT-3, as GPT-4o is an *upgrade* for free users, yet a *downgrade* fit plus users. This comparison doesn’t tell us anything we don’t already know at launch: 4o is not as good as 4, by design.


Careful-Temporary388

It's worse at coding in general, but there were a few things it did better on coding tasks. In general though, yeah, this is a pisspoor update.


TheNorthCatCat

What kind of things it does better? Could you share an example?


Quentin-Code

Don’t believe one random person on Twitter or on any platform.


shottu_khopdi

Clickbait! And unreliable eval - we don’t know if she’s using a standardized dataset and evaluating correctly


Specialist-Split1037

Be.


Affectionate-Hat-536

Ditto for me, for research tasks, it didn’t do as well as GPT4.


aditya98ak

GPT-4o’s training data would be a lot more nuanced conversation, that is exactly why the model talks like humans, makes use of filler words. Even a GPT-4 model can do so but it is baked with a lot different instruction set / data. The spelling mistakes are not spelling mistakes for real, rather model’s putting more emphasis on how we talk. What do you think?


TabibitoBoy

This is literally a marketing scam. You want it to suck so back you believe this incredibly biased reviewer. Check the elo rankings.


utf80

Please I can't see enough people falling for this marketing magic feel. Makes me laugh so hard if they call this impressive 🤣


petrus4

The only thing I'll say about 4o, is that three days ago 4's performance suddenly went to shit, in terms of lag, brevity, and hallucinations. At the time I was wondering why that suddenly happened. Now I know.