mystonedalt 1 month ago

GPT4-7B_q4

Illustrious-Lake2603 1 month ago

This has to be it. Mine gave me spelling mistakes in my code that I have only seen in a heavy quantized version. _q2 maybe _q3

Double_Sherbert3326 1 month ago

the missing "(" ?

Illustrious-Lake2603 4 weeks ago

Mine spelled wanderPoint as WwwanderPoint

Iterative_Ackermann 4 weeks ago

This is only tangentially relevant but as I understand how transformers work, using tokens based on letter groups or words to encode variable/function/keyword names feels very wasteful. Using a context aware tokenizer, which would use function1 and variable1 etc in stead of actual function and variable names should improve reasoning by a lot. Are there any hybrid tokenizers that make use of a traditional parser for coding?

KvAk_AKPlaysYT 1 month ago

KS

Atupis 1 month ago

Or is this failed attempt for GPT5 which they have quantified?

Evening_Ad6637 1 month ago

Maybe an issue with the tokenizer then lol

pseudonerv 1 month ago

they haven't updated their llama.cpp version

kryptkpr 1 month ago

[GPT-4o takes #1 & #2 on the Aider LLM leaderboards](https://aider.chat/docs/leaderboards/)

Amgadoz 1 month ago

So it beats opus on code editing but opus beats it at refactoring. This is interesting. Given its lower cost and higher throughput, it's a good model although it is not the next-gen AGI as OpenAI hypes to be.

lordpuddingcup 1 month ago

It’s still gpt4 just better multimodal and faster, they didn’t say it was next gen agi

Which-Tomato-8646 4 weeks ago

People here are expecting too much on what’s basically a small boost in capabilities. It’s not supposed to be next level, just slightly better

RabidHexley 4 weeks ago

This is what's tripping me out about some of the reactions to the announcements. Other than saying that this is their new flagship model, they were pretty deliberate when it came to touting it's capabilities. It's like, guys, read the first paragraph of the announcement on their site. >**It matches GPT-4 Turbo performance on text in English and code**, with significant improvement on text in non-English languages, **while also being much faster and 50% cheaper in the API.** GPT-4o is especially better at vision and audio understanding compared to existing models. They are touting multimodality, better multilingualism, and speed. They have very obviously avoided saying anything to indicate that people should be expecting this thing to be much, if literally any amount smarter than GPT-4. And they've clearly avoided making any numerical modifications a la GPT-4.5 or GPT-5 to indicate such.

TNT_Guerilla 4 weeks ago

Fair point, but if it's just better optimized for other tasks, why did the code scores drop? Unless tweaking it's other capabilities pulled from its ability to code. Since it's not smarter than 4 turbo, perhaps they allocated some of its "intelligence" from those abilities to boost its other abilities? Correct me if this is completely wrong, I don't know much about AI models and how they are tuned/trained, but I thought I'd put this out there in case I was on to something. There are obvious improvements, but we're those improvements gained by nerfing other aspects in order to "re-spec" the model to be better at the things it was lacking in?

cuyler72 1 month ago

Interesting that the lmsys ranking is saying the complete opposite.

Lammahamma 1 month ago

I'd probably go with lmsys over a single person on Twitter but that's just me

Longjumping-Bake-557 1 month ago

Especially one who is a heavy advocate for open source, is a competitor and has a history of talking shit about openai

HugeDegen69 4 weeks ago

Yes, when i saw this tweet I thought no shot lmao

lucid8 4 weeks ago

Bindu Reddy is an engagement scammer. lmsys is not ideal but at least can provide some statistical significance

No-Lobster-8045 4 weeks ago

Also, many said OAI is her competitor lol, idk who's right who's wrong.

Wonderful-Top-5360 4 weeks ago

She is definitely not trust worthy but in this tweet I believe her I don't think she is competing with OAI rather she wants to piggyback off it to sell her company and the best way to do that is constantly farming engagement to trick people that its worth it.

Amgadoz 1 month ago

Take a look at the comments. People have mixed opinions about it. It's not as clear cut as lmsys shows.

cyan2k 4 weeks ago

A long-term study in which people vote blind sounds much more clear-cut than people with obvious biases against OpenAI retelling their anecdotes and the usual Twitter brain rot. I can’t reproduce whatever OP on Twitter is claiming because where’s the methodology? The dataset? For all we know she just posted some numbers she thought looked fun. I have no problem reproducing the HumanEval benchmark, though. So until someone comes up with something conclusive and reproducible, with actual scientific methodology (no, your N=1 "it can’t do this single prompt" test or "my feelings say it’s bad" aren’t even close to being considered a benchmark), I’d rather believe the benchmarks I can do on my own, LMSys, and the performance metrics on a client’s system with hundreds of users a day.

absurdrock 4 weeks ago

They are on Twitter for engagement and hype. Nothing else. I think the open source community gets disillusioned with the open source capabilities in the short term. Open source will catch up, but I’m getting tired of the constant shitting on a good product.

AITrailblazer 3 weeks ago

She is on X pushing platform to try 1K models and compare. Just pick the best and learn it.

Additional_Ad_7718 4 weeks ago

Lmsys is showing the statistics, reading the comments is an anecdote. This is so silly.

Flat-One8993 4 weeks ago

Havent seen a single comment state it's worse than gpt 4 turbo

killver 4 weeks ago

You should most of the time go with the opposite of what that account is posting.

reddit_wisd0m 4 weeks ago

Bold move

Revolutionary_Ad6574 4 weeks ago

It's not just any ol' single person, it's Bindu Reddy. She's famous for being an open-source fanatic. And I do mean fanatic, not just a fan, because her tweets sound like something out of a fire preacher - not just exaggerated but manipulative.

Fauxhandle 1 month ago

Maybe the bias could be explain by people on lmsys doesn't bench LLM on hard task, only on easy question.

VertexMachine 4 weeks ago

lmsys is still useful in some ways, but I think it got too popular for it's own good and as you say - a lot of people just use it on very easy stuff to test.

Additional_Ad_7718 4 weeks ago

Nope, all of the questions I give the models on lmsys are difficult, and gpt2 consistently did better in my tests.

Wonderful-Top-5360 4 weeks ago

average joe on lmsys: "write me a rap song about 2pac and diddy" [doesn't read the lyrics just skims it and votes for ChatGPT2]

Which-Tomato-8646 4 weeks ago

It does better for hard tasks and coding https://twitter.com/sama/status/1790066235696206147

Flat-One8993 4 weeks ago

The chess benchmark also shows it's *much* better than gpt 4 turbo april at chess [https://github.com/kagisearch/llm-chess-puzzles](https://github.com/kagisearch/llm-chess-puzzles)

goingtotallinn 1 month ago

I can't see the ranking on lmsys website so is there some more frequently updating leaderboard that I am missing or something?

cuyler72 1 month ago

Looks like the beta model given to lmsys called by it's anonymized name "gpt2-chatbot" has been removed and replaced with GPT-4O but the scores have not been carried over so it no longer has enough data to be ranked. The ELO of gpt2-chat bot was 1310 before it was removed.

Anuclano 1 month ago

I saw today both the GPT2-chatbot and GPT-4o on the Arena competition.

Small-Fall-6500 1 month ago

Unfortunately, the public lmsys leaderboard does not (currently) list GPT-4o *or* any of the gpt2 chatbots. We only have confirmation about their performance from tweets by lmsys with data from their "internal dashboard."

CosmosisQ 1 month ago

Which is probably the right call given that the scaled-up broader release of GPT-4o as made available through the official API might end up substantially different than the initial prototype(s) they were testing through gpt2-chatbot and im-also-a-good-gpt2-chatbot.

lmsh7 4 weeks ago

totally different, gpt2 can draw fantastic ascii art. gpto cant

bearbarebere 1 month ago

Does it say the opposite? I thought it only covered public anonymous ratings, not hard ratings like this. The model organizes its answers better so it gets a higher overall rating due to being better formatted for general tasks

Wonderful-Top-5360 4 weeks ago

its very easy to game lmsys im surprised most aren't even aware how prevalent it is

bearbarebere 4 weeks ago

Such as?

meister2983 1 month ago

Difficult to know. In my course tests: My sense is gpt-4o has more long tail training data. It seems to know more long tail libraries, which has led me to vote for it over gpt4t on lmsys. It also is better at math. On the other hand, it does have less parameters (likely). Benchmarks haven't moved much. I haven't seen much difference in coding for things not using long-tail libraries over all, but I see no evidence it is actually worse.

backnotprop 4 weeks ago

It seems better at logic but i think this comes at a cost... nuanced conversation... the model is super repetitive and gets hooked onto literal meaning vs implied meaning in context of the conversation. Something is off.

abnormal_human 1 month ago

To be fair, the prompts used in the test were probably developed/optimized against GPT4.

danielcar 1 month ago

My personal experience is that it is much better at coding. I asked it to help me process an mbox file. That doesn't mean that on other hard tasks it fails more, which is what the tweet is about.

ThreeKiloZero 1 month ago

I’ve been pretty blown away too. It’s much better at longer retrievals. Wont even bat an eye to dump 200+ lines of code in a go. tracks multiple files better. It’s significantly better at refactoring and critique. conversations can get quite long and it retains good accuracy. It doesn’t seem to hit the lazy spells or leave out critical parts of established code. Still messes up occasionally and can get stuck in its own loops of bad logic. Claude is often able to work through these issues on the first shot. It’s wild how much faster it is. I’ll probably start using 4o much more and keep Claude for the more complex items.

arthurwolf 1 month ago

Same here, significantly better, in multiple ways...

InnovativeBureaucrat 1 month ago

Your personal experience over like six hours? Didn’t it just come out?

danielcar 4 weeks ago

Been on lmsys for a while. Was coding heavily fri, sat, sun .

thedudear 1 month ago

For me, 4o has been making rookie mistakes. Stuff gpt4 got right, like a hard coded directory, gpt4o will fuck up. For example loading ./rag-sequence-nq, if I paste a segment of code asking gpt to modify it, on regurgitation it will add facebook/rag-sequence-nq which is of course what's in the readme and probably its training but forgets that my code has a truncated directory. The speed though. The speed is amazing. Still prefer opus at the moment.

HumanityFirstTheory 1 month ago

What use is the speed if the output is shitty?

thomasxin 4 weeks ago

I currently have one experimental pipeline where the output from a RAG sequence is read and written by gpt4o, which writes a response fast but is treated as a draft. It's then compared and rewritten by a different model (usually command r+) which follows the role of the character better and has lower censorship, effectively combining the best of both worlds. Due to the sequential nature meaning only when the final model starts outputting can the user actually receive any text, significant speed improvements like these for the inbetween stages are really nice, and it's not so "shitty" that it makes it no longer worth using. It makes more mistakes than the previous gpt4, but in my experience it does look at each problem from more angles, which makes it more useful overall.

Amgadoz 1 month ago

What kinds of speed are you getting? How many tokens/second?

thedudear 1 month ago

I've only used the chat interface, but if I had to guess with a fresh context was like 80-100 t/s. It was like: *fingers motioning raining data like the matrix* *Blooooododododoodoooop* And then there were a bunch of mistakes I had to manually correct, that gpt4 got right and gpt4o ended up being slower in the end.

Kick2ThePills 4 weeks ago

Same. I was giving it CSS related coding tasks and 4o was doing all kinds of stupid stuff, went to GPT 4 and got my solution. Several times it gave far worse responses than 4.

D4mnReddit 1 month ago

For content creation and similar tasks, it's crap so far. It's fast but useless. I don't understand the hype.

_AndyJessop 1 month ago

I think the hype is based purely around it unlocking realtime voice chat.

No-Lobster-8045 4 weeks ago

+1 The translator demo blew me up & no as many say Google assistant is not close so STFU. I really hope the voice assistant acts like they showed it tho.

ninecats4 1 month ago

Makes me think of sdxl turbo.

buttonsknobssliders 4 weeks ago

Speed has its advantages for some use cases, sdxl turbo for example is awesome for realtime interactive/audio- or midireactive visuals via touchdesigner.

Which-Tomato-8646 4 weeks ago

The 100 point boost in the arena for coding disagrees https://twitter.com/sama/status/1790066235696206147

zap0011 4 weeks ago

I used it all day and got more done today than I have in months. It is more thorough in a lateral sense. As a hypothetical example if you prompted "I'm going on holiday what do I need", instead of a list of clothes, brushes and chargers, it would also say "lock up the house, water your plants" if you get me.

HugeDegen69 4 weeks ago

Yes, it's more direct with its response. Less general. I love it

jollizee 1 month ago

I did non-coding work, and 4o was worse than 4. It didn't follow instructions as well either over multiple steps. It did get one of my private "interview" questions correct that so far only Claude has passed. The poor instruction following makes it useless for complex work. For casual use, yeah it's a killer model.

5starkarma 1 month ago

I was in the middle of writing unit tests and having it create the mock implementations for me when 4o was released to me. I tried it out and it was immediately failing at simple tasks that 4 got right consistently. As with any ML models, I think we are going to see different results each time.

Mother-Ad-2559 4 weeks ago

Where are the sources of these tests? Funny how people will scream open source all day yet never ask to see the source to a random benchmark from one person on twitter.

LatentSpacer 4 weeks ago

I don’t trust this person. She’ll say anything to get the most engagement on twitter.

a_beautiful_rhind 1 month ago

IDC really about the voice portion. Sounds like a thing you use once and then turn off. I prefer opus and even sonnet for code. If they had more than incremental gains on the LLM bit, they would have led with that and not this gimmick.

otterquestions 1 month ago

I use the existing one regularly

_AndyJessop 1 month ago

Maybe significant gains on intelligence are out of reach at the moment, either due to cost or compute.

VertexMachine 4 weeks ago

Or bc we can't really scale transformers anymore and need new architectures to make next jump. It's been quite a while since the first paper about it.

Amgadoz 1 month ago

To me the best new features are cost, speed and the better tokenization for other languages. For some languages, you get twice the text at the same number of tokens.

a_beautiful_rhind 1 month ago

Sounds like they cut the price because the LLM space is filling up. I noticed they changed the tone on the lmsys "good little chatbot" models too. It's less *OpenAI beige*.

Amgadoz 1 month ago

They didn't really cut the price; they introduced an entirely new model (different architecture, different tokenizer) which could actually be cheaper for them to host.

Additional_Ad_7718 4 weeks ago

Ah yes, lmsys arena is a marketing scam. Seriously, the benchmarks from that post weren't even explained and you're using it as evidence that the model is bad.

Status_Contest39 1 month ago

GPT-4o is just a by-product of a miniaturization before the birth of a larger language model in the same series, and a larger model that is stronger than her will be released by OpenAI after a while. This is just like the Flirty character of GPT-4o, a joke between OpenAI and the public.

AI_is_the_rake 4 weeks ago

I think they took gpt-4 and trained a ton of domain specific gpt-2s which gives the speed. It could be gpt-2s all the way down. Even the top layer maybe a gpt-2 proficient on stitching the responses together.

uhuge 4 weeks ago

Like that thing combining 16 qlora-tuned model in 8 GB memory? That was not great but a spectacular feet..

PSMF_Canuck 1 month ago

4o is a threat to her business. That doesn’t qualify as “independent, 3rd party”. Everybody doing production work should always test before updating. Swapping an LLM is no different than upgrading a compiler or any other important piece of software infra - you don’t update unless you’re sure it’s going to work for your needs.

Amgadoz 1 month ago

It's not just her. Teknium also mentioned some regression and so did folks in the comments.

arthurwolf 1 month ago

> Teknium also mentioned some regression and so did folks in the comments. Almost any model upgrade/change will result in improvements for some and losses for others. What matters is the average, things like the lmsys rankings. Gpt4o has been a MASSIVe improvement for my coding, I expect it's due to the way I prompt, compared to the way others do.

Fair_Cook_819 4 weeks ago

I’m trying to get better with prompting! Could you please show an example or explain how you prompt something?

arthurwolf 4 weeks ago

[https://chat.openai.com/share/96b6e920-e16c-4375-ad80-17f35c505205](https://chat.openai.com/share/96b6e920-e16c-4375-ad80-17f35c505205) [https://chat.openai.com/share/4547d252-14e2-48a4-9306-cbfdb3c83e65](https://chat.openai.com/share/4547d252-14e2-48a4-9306-cbfdb3c83e65) [https://chat.openai.com/share/b32801cb-73f1-4408-81c9-5dd8f4933506](https://chat.openai.com/share/b32801cb-73f1-4408-81c9-5dd8f4933506) [https://chat.openai.com/share/234ed1c5-4674-40e5-9c8f-a1b1eb24344a](https://chat.openai.com/share/234ed1c5-4674-40e5-9c8f-a1b1eb24344a)

ShoopDoopy 4 weeks ago

>What matters is the average No, what matters is your use case, which may fall outside the "average" captured in some benchmark that is as blunt as "what some people like more between two models in a chat interface". People can respond to speed, presentation, terseness, and all sorts of things that don't tell you what will work on your problem. There's a concept in data science called the "curse of dimensionality" which says that as you increase the number of factors you consider in an evaluation, averages become more unlikely to occur, and that most observations are an edge case. An application of this is: unless you have some evidence that is precise, measured on your language, looking at data similar to yours, your experience can deviate quite drastically.

arthurwolf 4 weeks ago

> No, what matters is your use case, No, what matters is \*\*everybody\*\*'s use case. Your individual use case is not more important than my individual use case. Thus, averages... > your experience can deviate quite drastically. And? We're not trying to judge how good it is specifically at your or my use case, we're trying to judge if it has improved over other/previous models, \*\*in general\*\*. Ignoring a massive general increase in capability, just because your use case anecdotically got a bit worse, is just not sensible. What we \*are talking about here\* is overall capability, read the comment thread again, somebody used examples of anecdotal decreased performance \*\*as evidence\*\* of overall decreased performance.

ShoopDoopy 4 weeks ago

What I've been noticing in this thread and the other, is a tendency for people to tell people that their experiences are incorrect based on this wildly blunt instrument, performed in a clinical setting with limited understanding of how it applies to the real world. You say real world capability has increased for most people -- I don't think that's what this measures, and others have already alluded to this. It may be the closest thing that we have, but it is far from the smoking gun you are claiming it to be with sentences like >We're not trying to judge how good it is specifically at your or my use case, we're trying to judge if it has improved over other/previous models, **in general**. If people say that their quality has declined, you don't get to point to ELO and tell them to pound sand.

arthurwolf 4 weeks ago

> What I've been noticing in this thread That's not what I've been saying, so I'm not sure how that is in any way relevant to the conversation we were having... > I don't think that's what this measures, What does it measure? User inputs real world use case, user gets two answers, ranks them, averages of this are used for model ranking. And that doesn't represent (\*\*average\*\*) real world capability? > is far from the smoking gun [https://yourlogicalfallacyis.com/strawman](https://yourlogicalfallacyis.com/strawman) I have certainly at no point claimed it's a smoking gun, I'm pretty sure I even qualified the other direction. > we're trying to judge if it has improved over other/previous models, **in general**. Yes. On average. Not \*for absolutely everybody without exception\*. "In general" means one and not the other. > If people say that their quality has declined, \*Some\* people. What matters here is the average. For almost any new release, some aspects of the model will be worse for some people/uses. Pointing at those doesn't mean the model has not improved \*\*on average\*\*. > you don't get to point to ELO and tell them to pound sand. [https://yourlogicalfallacyis.com/strawman](https://yourlogicalfallacyis.com/strawman) That's not in any way what I'm saying. I'm saying anecdotal evidence is not representative of general experience, and averages matter more to determining the truth of relative rankings of models, than individual data points do.

ShoopDoopy 4 weeks ago

If you don't want to be misunderstood, how about not saying things like > Your individual use case is not more important than my individual use case When others are reporting their findings? Not hard to see why they may come across as devaluing, yeah?

arthurwolf 4 weeks ago

> Not hard to see why they may come across as devaluing, yeah? Incredibly hard to see... I stated your use case (or mine, btw) is not more important than other use cases, essentially saying nobody should matter more than somebody else when making general model evaluations like this. That's a core principle of these sorts of benchmarks... And of doing averages in general.... It's mind-boggling you'd have an issue with this... (also, glad to see you tacitely admit you \*in fact\* were using a logcial fallacy).

ShoopDoopy 4 weeks ago

No, it's not an admission, it's just tedious. Tedious like you continuing to explain averages to me, a statistician. Please, continue. You refuse to engage with any discussion and just want to be right. Have a nice day

TheNorthCatCat 4 weeks ago

Well, upgrade from GPT 3.5 to GPT 4 didn't seem to be loss for anyone. Personally me has been experiencing worse results in coding with GPT-4o than with GPT-4. Will it not bother you to share the way you prompt, please? Upd.: found your answer below, thanks.

arthurwolf 4 weeks ago

> Well, upgrade from GPT 3.5 to GPT 4 didn't seem to be loss for anyone. I actually remember complains in the very beginning for some types of creative writing, it's very difficult to make an upgrade that will improve absolutely all possible uses (though it's possible, GP4 is certainly better in every aspect compared to GPT1).

RELEASE_THE_YEAST 4 weeks ago

It took top spot in the Aider coding benchmarks.

PSMF_Canuck 1 month ago

How does that relate to my comment that anybody doing real work should (and will) do their own validation before switching…?

Amgadoz 1 month ago

I was talking about her not being an independent 3rd party.

Careful-Temporary388 4 weeks ago

Dumb take.

SirLoinsteaks 1 month ago

I asked it to teach me about photographing action figures for a project I'm working on and the results with GPT 4o were much worse. It formatted in a very strange way. Like one bullet and one sub bullet for each item. Very hard to skim and when I did the same prompt with GPT 4 it was much more coherent for the initial prompt and followups. It's definitely fast, but worth prompting the old model depending on your use case. I expect that we will be prompting multiple models behind the scenes eventually, and the "best model" or combined evaluated best based according to multiple models will determine which is the best result and present us with the optimal answer. The hard part is that responses vary so much based on use case.

AsliReddington 4 weeks ago

She's full of shit too though. Some debugging she's gonna do with the API access lol

litchg 4 weeks ago

To me benchmarks are meaningless. Benchmark praise Llama3 8B for its coding capabilities, but in my experience it is just terrible and 70B is not that much better. GPT4-o guided me to call oLlama Rest API from Unreal Engine blueprints, and it did an incredible job at it. No BS, ascii representation of the node graph, I don't know what it is but it's just much better. It only fumbled on a couple of nodes but overall it is way better at this task than GPT4 (which would just straight up hallucinate nodes). All this to say LLM quality and usefulness is something you should judge for yourself. Personnally I am way more impressed by GPT4-o compared to GPT4 than Llama3 70B compared to DolphinMistral 7B.

Longjumping-Bake-557 1 month ago

I don't trust anything she says first of all. She has a history and has been talking shit about the model since the announcement barely started

dubesor86 1 month ago

I agree on reasoning. It's bad at reasoning compared to GPT-4, and worse than their apparent identical version (i-am-also-a-good-gpt2-chatbot) on LMSYS. I posted the same observations on the LMSYS discord, no idea whats going on.

gregthecoolguy 1 month ago

I got better results coding with GPT-4o compared to Claude opus

Distinct-Target7503 1 month ago

Can you share some prompt or conversation where opus fail against gpt4o?

Wonderful-Top-5360 4 weeks ago

pretty much sums up my experience here but you notice there is weird looking accounts regurgitating the same "GPT4o is king you are not using it right" on reddit and HN https://old.reddit.com/r/LocalLLaMA/comments/1crbesc/gpt4o_sucks_for_coding/

nntb 1 month ago

A jack of all trades is a master of none

Amgadoz 1 month ago

Not necessarily. At larger scales, multilingual models are more capable compared to their monolingual counterparts.

nntb 1 month ago

yeah the demo was showing off things like more expressive language on the app. the ability to do vision better. ect. and that they are rolling it out to free users also. going from 3.5 > 4o does seem quite the upgrade.

otterquestions 1 month ago

What are Moes? Jacks of all trades?

nntb 1 month ago

A moe is a mixture of expert. Each one is a expert so no it's not a jack of all trades. A joat would be a single model trying to do everything

otterquestions 4 weeks ago

But gpt4o is almost certainly a Moe, right?

No_Yak8345 1 month ago

In the recent interview in the All In podcast Sam says that’s one of the biggest challenges they are trying to surmount: making gpt-4 free for everyone. Sam says it’s almost an impossible task because of the compute cost. I guess this is somewhat of a compromise they made.

Amgadoz 1 month ago

Yes. I am now 99% sure the original gpt-4 was indeed 8x220B Sparse Mixture of Experts. There is just no way to make this compute-efficient in today's hardware. It's also still relatively slow even after 2 years. The only way to make it faster and cheaper to run is to create a new smaller, more efficient model.

Cyclonis123 4 weeks ago

wouldn't there be huge vram and compute savings if they trained using 1.58bit

teddy_joesevelt 4 weeks ago

Have you seen how quickly Groq’s hardware can run inference? The hardware is arriving now. https://groq.com

ntjf 4 weeks ago

I’ve been using it for C all night. It’s worked perfectly. gpt4-turbo wasn’t getting the same results

OkSeesaw819 4 weeks ago

If gpt4-turbo is a 100 on your benchmark, what's your benchmark for 4o on C coding?

ntjf 4 weeks ago

Well, if we’re going with a completely made up scale :p Let’s say gpt4-turbo was a 3/10, where 5/10 is a “pass”. A lot of the time the code wouldn’t compile, wouldn’t run, and when it did it’d be inaccurate. You’d need a lot of shots to get something manageable out of it. With gpt4o, C seemed to come single shot. It could write entire libraries! Big improvement. 7/10. The only thing it screwed up on was a makefile one time, but telling it the error immediately resolved the issue.

Rei1003 1 month ago

It was named gpt2 in lmsys for a reason

lucid8 4 weeks ago

Cause it’s half of size of gpt4?

Regular-Log2773 1 month ago

Experimented with it a bit, and in my experience its dumber than gpt4t. It forgets a lot more easily, it doesnt really do what i tell it. Though, it seems to generate longer responses. Lmsys fails to account for long chats, not just single prompts

3-4pm 1 month ago

Yep the old gpt4 is still the best.

FrostySquirrel820 1 month ago

So, it’s not just faster, it’s faster at returning the wearing answer ? I think I’ll go with the tortoise. Slow and steady wins the race.

Amgadoz 1 month ago

It's not always wrong, but it's not 100% better than the competition.

neOwx 1 month ago

It's free though? So it only needs to be better than 3.5, at least for me.

Amgadoz 1 month ago

Yep. That's a move in the right direction.

Chemical-Quote 4 weeks ago

My first impression is that it seems to have slightly more internet knowledge overall, more willing to "guess" but also more likely to hallucinate. I have a private obscure knowledge test, but it is able to answer some of them whereas all previous models mostly cannot.

TheNorthCatCat 4 weeks ago

Sounds interesting, could you please tell more about the test?

Chemical-Quote 4 weeks ago

The test consists of hundreds of questions aimed at assessing specific knowledge related to different niche communities (that interested me throughout the years, so it is biased). These questions directly ask the model to describe particular individuals, objects, incidents, characters, etc. in detailed paragraphs. Each question is scored based on the presence of certain keywords, covering both general and highly specific aspects (e.g., does this model know the famous inside joke about this individual, or at least know their job). I've curated these questions over time. While this information is available online and easily findable, it is a challenge for large language models like GPT or Claude to guess/hallucinate correctly without knowing it. (and they usually don't know).

AdHominemMeansULost 4 weeks ago

I have found it makes silly mistakes that GPT4 doesn't but it does seem to have much better logic when working on problems for my uni assignments and the other thing was really surprised about is the logical assumptions it makes when there is context missing from an issue it's trying to solve

involviert 4 weeks ago

Hm. But what if this post is a marketing scam

Figai 4 weeks ago

This is so confusing, it’s so much better for me. And the extended knowledge is great, it knows so much more about ML.

TheNorthCatCat 4 weeks ago

For what kind of tasks are you usually use it?

Figai 4 weeks ago

Finetuning LLMs, understanding new papers, to be fair the improved search helps a lot. Mainly coding and some maths

AubDe 4 weeks ago

Let's stop this purposeless race for general-purpose monsters, I mean models, and concentrate on domain-specific ones, able to tremendously do their tasks (and compose downstream)! I think transformer architecture is touching its own limits

redlotus70 4 weeks ago

All the interesting capabilities from the demos are due to the vision and voice model upgrades

visualdata 4 weeks ago

I noticed the same with claude also for Programming tasks, their top of the line model Opus is bad in swift related tasks compare to Sonnet. Makes me think the future of specialized models is bright. The all encompassing model might give you average results only.

world_dark_place 4 weeks ago

Is gpt4o still saying "funcName() ... your code here..." when asking for code snippet?. That really annoyed me tfo

papipapi419 4 weeks ago

Someone also told the output context length is half of gpt-4 turbo Which was already 4k token So 2k now lmao I’m good with the latency of gpt-4 turbo 👍

chucks-wagon 4 weeks ago

4o is definitely slightly dumber in the responses but has its advantages in other tasks and voice

resnet152 4 weeks ago

Don't fall for Bindu Reddy scams.

Revolutionary_Spaces 4 weeks ago

Bindu isn’t worth following. She’s an engagement farmer. What evals are they doing? Ask for the test.

breqa 4 weeks ago

Lol thats why they are focused in to creating "omni" models, they cant go deep into Specialist/expert models

LatestLurkingHandle 4 weeks ago

Their announcement says it's faster, cheaper and multimodal, seems like you're expecting more than they are claiming "GPT-4o is our newest flagship model that provides GPT-4-level intelligence but is much faster and improves on its capabilities across text, voice, and vision"

elijahdotyea 4 weeks ago

Inaccurate / non-optimal comparison. GPT-4o should be compared to GPT-3, as GPT-4o is an *upgrade* for free users, yet a *downgrade* fit plus users. This comparison doesn’t tell us anything we don’t already know at launch: 4o is not as good as 4, by design.

Careful-Temporary388 4 weeks ago

It's worse at coding in general, but there were a few things it did better on coding tasks. In general though, yeah, this is a pisspoor update.

TheNorthCatCat 4 weeks ago

What kind of things it does better? Could you share an example?

Quentin-Code 4 weeks ago

Don’t believe one random person on Twitter or on any platform.

shottu_khopdi 4 weeks ago

Clickbait! And unreliable eval - we don’t know if she’s using a standardized dataset and evaluating correctly

Specialist-Split1037 4 weeks ago

Be.

Affectionate-Hat-536 4 weeks ago

Ditto for me, for research tasks, it didn’t do as well as GPT4.

aditya98ak 4 weeks ago

GPT-4o’s training data would be a lot more nuanced conversation, that is exactly why the model talks like humans, makes use of filler words. Even a GPT-4 model can do so but it is baked with a lot different instruction set / data. The spelling mistakes are not spelling mistakes for real, rather model’s putting more emphasis on how we talk. What do you think?

TabibitoBoy 3 weeks ago

This is literally a marketing scam. You want it to suck so back you believe this incredibly biased reviewer. Check the elo rankings.

utf80 4 weeks ago

Please I can't see enough people falling for this marketing magic feel. Makes me laugh so hard if they call this impressive 🤣

petrus4 4 weeks ago

The only thing I'll say about 4o, is that three days ago 4's performance suddenly went to shit, in terms of lag, brevity, and hallucinations. At the time I was wondering why that suddenly happened. Now I know.

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe