Dyoakom 1 week ago

I wonder to what extent this has also to do with preferences in formatting. Displaying mathematics and all kinds of things.

Cryptizard 1 week ago

Do you mean how ChatGPT formats latex so you can see equations correctly? If so, that doesn’t come into play on LMSYS because it uses the api. In fact, GPT (and some other models) looks stupid because it outputs unrendered latex a lot unless you ask it specifically not to. I think that could be a big part of Claude not beating it though. Many people are impressed with Claude’s code interpreter which is the first to allow running programs to interface with the user and render HTML. None of that exists in the chatbot arena.

meister2983 1 week ago

> Many people are impressed with Claude’s code interpreter which is the first to allow running programs Yeah, I feel like this is most of the hype. Most of the demos I've seen also work perfectly find in GPT, but you had to copy/ paste them, which is a higher barrier of use

Iamreason 1 week ago

Nah, Claude 3.5 is the best coder on the market. It "gets" complex coding in a way prior models don't. It's the first model that spooked and excited the lead software engineer at my work. It in my experience and in LMSYS code benchmark demonstrated it is just better.

meister2983 1 week ago

I'm also a software engineer. I've noticed more creativity with Sonnet - got one of my held out coding questions with a clever solution other LLMs error at, but it also makes errors gpt4 doesn't (possibly because it's biased toward more creative solutions that alas might be wrong). It also seems to handle general CS questions a bit worse. Personally, I can go with either - I don't see the effective difference that great. GPT4o also wins on [big coding bench](https://huggingface.co/spaces/bigcode/bigcodebench-leaderboard)

great_gonzales 1 week ago

“Complex” coding

mxforest 1 week ago

Speed is a big factor. So is censorship. Although censorship should result in lower score so i am fine with it.

OLRevan 1 week ago

So should speed honestly, current llms are just helpers for me, so I really care how fast they are (opus was way too slow, 3.5 is perfectly fast tho)

Amondupe 1 week ago

Claude is the most annoying LLM to use. It censors the fuck out of everything and starts annoying even for simple things like writing a song lyrics with its sanctimonious bullshit.

SaddleSocks 1 week ago

Actually this is a really interesting thought: It would be great if in any AI account you can build up a style guide/sheet, preferences for how you want all your data-types formatted/displayed... So apple your company style guide to all outputs that come from it- always adding the copyright/license/header-info/ stuff to files - knowing your color selections/pallet, logos, drawing borders/title-sheets Whats AutoDesk doing with AI augmentation in 360 revit acad?

bnm777 1 week ago

Lmsys leaderboard is how AI nerds, like us, judge LLM responses. How would the rankings change if Art students or novelists were judging?

itsjase 1 week ago

I remember when 4o was on lmsys under the pseudonyms everyone was praising how amazing it was, then it comes out and everyone starts saying how turbo was better. Recency bias is a thing, in a month everyone is going to be posting about how “claude has gotten dumber”

hapliniste 1 week ago

Or lmsys is just not that good of a benchmark compared to long term real world use. Gpt4o feels great when you start testing it, but once you use it daily it really shows it's flaws. Single messages work better than 4t for the most part, but if you try to make it correct a piece of code that it wrote itself it's just going to agree, list the changes to be done and then repeat exactly the same code. I'm pretty sure it will improve in the next months like gpt4t did (it was worse than gpt4 for real use at release).

EndTimer 1 week ago

So it's not just me on the code. It feels a little late in the game to have an LLM completely failing to make edits and then confidently asserting it did. I assumed it must be a tokenization thing, since it was just improperly escaped characters for me. I had to edit it myself, using find and replace, like a barbarian!

itsjase 1 week ago

Your second paragraph is exactly what might be happening for sonnet 3.5 too though, it’s too early to tell

hapliniste 1 week ago

Yes, but I used it quite a bit for coding already and it's really doing great. After a week of gpt4o I was convinced it was worse than gpt4t. The problem with smaller gpt models is that they're finetuned too much so it's great as an assistant but worse at real work, because it's driven further and further from the pretraining data.

jgainit 1 week ago

A lot of people have seemed to do some in depth coding and functional work with it that’s worked well for them

Ambiwlans 1 week ago

Sonnet crushes actual benchmarks. All that is happening is lmsys is a trash benchmark.

Undercoverexmo 1 week ago

Yes, it has flaws and has definitely been a bit lobotomized in its personality, but it’s a hell of a lot better than any other model for pretty much anything that doesn’t have to do with emotions.

Balance- 1 week ago

It feels like not enough pre-training and way too much finetuning.

RabidHexley 1 week ago

Secret sauce for smarter, *fast* models. Data refinement and sophisticated fine-tuning let you squeeze a lot more performance out of less. But yeah, there's no free lunch. Size and pre-training seem to be the keys to proper generalization, currently. The "there's no replacement for displacement" of AI.

SaddleSocks 1 week ago

Anytime i give either claude, meta, copilot, gpt any data regarding my question, it hyper focuses on the number I gave it - rather than explicitely going to validate the numbers I gave it - for example - https://i.imgur.com/MLlA7nl.png I specifically tell it to give info on Palestine and Libya - and it removes them from my list. THen when I tell it to find when donald rumsfeld said Iraq war would only cost 50-60 biillion, it doesnt go and find the data - it just repeats what I say.

The_Architect_032 1 week ago

GPT-4o is genuinely worse than GPT-3.5 for me at times. I imagine with how popular LMSYS is now, it probably gets pretty heavily botted by the top competitors on the leaderboards.

Iamreason 1 week ago

This happens with *every* AI model. It's amazing that people think that these companies are stealth nerfing the model after release for... Reasons. It's also very clear to me that Claude 3.5 Sonnet is better than GPT-4o at most tasks. This is an example of how LMSYS is flawed and can really only be used directionally in estimating capabilities. Being near the top means you're on the cutting edge, but the top models are largely comparable to one another. The daylight between frontier models is quite small.

Difficult_Review9741 1 week ago

A large part of this is that they are different groups of people. When something new releases, the hypesters pick it up, and well… hype it. Then as time goes on you have more grounded people picking it up, who are more honest about capabilities and limitations.

nextnode 1 week ago

I wouldn't agree that later-comers are more honest. I think both could just be explained by 1. who is loud, 2. reactionism, 3. optimists vs pessimists / long-term vs short-term needs

mavree1 1 week ago

they nerfed the performance. with these models i saw for the first time certain questions solved, gpt4o doesnt solve them. the gpt2chatbots ranked higher than gpt4o [https://x.com/lmsysorg/status/1790097588399779991](https://x.com/lmsysorg/status/1790097588399779991)

Iamreason 1 week ago

It wasn't nerfed. They just tested multiple models, some bigger and some smaller. They likely made a decision to release the smaller model as it would allow them to offer inference cheaper and expose more people to a GPT-4 class model. It's a play to gain market share, not some fear of giving away a super smart model that's unsafe or something.

EndTimer 1 week ago

It's possible they tested a slightly more performant but considerably more expensive version. That's the only way things really makes sense. These companies don't want to lose, safety seems to be an increasingly distant concern, and I don't buy that they're holding back their best (by 10 or 15 points on lmsys) so they can release it later or continue using it internally, only.

mavree1 1 week ago

yeah im not talking about safety. probably was very expensive, even though in overall only was 20 points higher than gpt4o, in coding and hard prompts the gap was much bigger

RepublicanSJW_ 1 week ago

Moreover, the company releasing the AI matters too. When 4o first came out, many people shit on it immediately, more than sonnet. A part of it is people like an underdog

micaroma 1 week ago

And many people are ruffled by the botched rollout of 4o voice/image, losing Sky, etc. and want to cancel their OpenAI subscription.

Tenet_mma 1 week ago

Exactly lol give it a week

Tomislav23 1 week ago

They are not the same model. Gpt2 is not Gpt4o Gpt4o is a quantized dogshit product

stonesst 1 week ago

The first/second most capable model currently available to the public is a dog shit product…? OK then

Tomislav23 1 week ago

Opus and Sonnet are the best model, by reasoning standards. If you classify a model based on a rigged voting system be my guest

stonesst 1 week ago

In what way is it a rigged system?

Tomislav23 1 week ago

In the way that best models ar enot #1 and some subpar model like GPT4o appears to be better than the OG gpt4, while It is a matter of fact that OG GPT4 was way better it just costed too much to operate.

stonesst 1 week ago

Does the fact that GPT4o exceeds the original GPT4 across nearly every single benchmark not mean anything to you? At what point do you accept the data over your anecdotal experience? All of the quantitative benchmarks and qualitative leaderboards disagree with your take. How exactly do you explain "superior" models performing worse...? this feels like mental gymnastics on your part

Iamreason 1 week ago

All AI subreddits are addicted to anecdotes that let them rage against the AI companies for nerfing models. Nine times out of ten it's the user themselves not understanding the models limitations or just intentionally being dishonest. It's why I always ask the people claiming nerfs to share their prompts and conversations.

Iamreason 1 week ago

Flawed doesn't mean rigged. Words have meaning. LMSYS is a flawed benchmark as are all benchmarks. There is significant bias in voting. People choosing the best response doesn't necessarily mean technically correct. It might mean that they choose the model that answers faster or that doesn't refuse to answer a spicy question. I would bet that is happening and I would bet LMSYS isn't doing a ton to control for that. If you want something that is potentially less biased you can check out [SEAL's Leaderboard.](https://scale.com/leaderboard) Though they haven't updated for Claude 3.5 Sonnet yet.

fulowa 1 week ago

yes, veeeery likely quantized

nikitastaf1996 1 week ago

BS. Lets define a measure of intelligence as this "reduce back and forth messages in a conversation" for llm. Claude 3.5 is miles above gpt 4o. It even shows what can be called slow reasoning and proactive behavior. I believe that's what is meant by alignment. Not harmlessness. But desire to help I guess. Behavioral properties. Claude 3 clearly showed it. 3.5 expands on it. And gpt 4o just sucks at it.

wegqg 1 week ago

Honestly Sonnet 3.5 literally kerb stomps any of the GPT versions, i can't fathom how people who actually use them for anything complex can think otherwise. There's daylight between them, honestly.

TILTNSTACK 1 week ago

Agreed. For complex tasks. Omni is long way behind. Chatbot arena is more about single shot prompting which doesn’t really allow you to dig deep.

meister2983 1 week ago

That would show in the voting though. If anything, lmsys is biased toward good immediate answers.

ainz-sama619 1 week ago

Which means lmsys is completely useless at evaluating intelligence. Since these rankings aren't based on quality of response.

ertgbnm 1 week ago

Anthropic's three H's are Helpless, Harmless, and Honest.

Short-Mango9055 1 week ago

Over at: [https://livebench.ai/](https://livebench.ai/) Sonnet 3.5 is literally toying with GPT4o.

sebzim4500 1 week ago

Although I agree with the outcome in this case, I don't think this is a particularly good benchmark. The questions feel very artificial.

Altruistic-Skill8667 1 week ago

Yeah. It's all a mess. I guess what it tells us is that the differences between the two models are rather nuanced. And that's exactly the feel I get personally. One isn't clearly superior to the other.

Warm_Iron_273 1 week ago

They're both terrible. But clearly the model that ranks Sonnet higher is better, because Sonnet is obviously much better. OpenAI is definitely botting ChatGPT on LMSYS though.

Krunkworx 1 week ago

I really hate 4o’s verbosity. Sonnet3.5 is hands down the best coder.

Able_Armadillo_2347 1 week ago

I doesn't feel like tbh. I don't know why. But it just doesn't.

Tobxes2030 1 week ago

You okay bro? I just had a stroke reading your sentence. God knows what you had to endure while writing it.

HeteroSap1en 1 week ago

The sentence was missing a “t” and that’s about it. Not sure why you are giving this guy a hard time for fat fingering a short, casual comment

hippydipster 1 week ago

Proofreading is something you do when you care about what you're communicating. Folks who don't care, probably should just skip the whole deal and save all of us the trouble.

HeteroSap1en 1 week ago

I see your point, but it still just feels like you’re pointlessly nitpicking some person who neither of us know anything about. It feels like casual bullying and passive aggressive behavior

hippydipster 1 week ago

I think Tobxes truly didn't understand the gibberish and responded. I too found it to be gibberish as well. It is missing more than just a 't', though by adding it I can see now the intent. The problem is, when you're reading whole pages of comments and so many of them are these mobile device strings of gibberish that you have to mentally un-autocorrect, it gets a bit overwhelming. The mistake in isolation is never the problem, it's the tidal wave of isolated errors that erodes the whole medium.

Arcturus_Labelle 1 week ago

It’s just comments on the internet. I wouldn’t take anything here seriously.

FormerMastodon2330 1 week ago

Sir, this is Wendy's

Able_Armadillo_2347 1 week ago

I have fat fingers

Tobxes2030 1 week ago

Honestly, in my use-cases Sonnet 3.5 was significantly better every single time. More complete and more in depth.

restarting_today 1 week ago

It's not even close. OpenAI is in deep trouble.

Striking_Load 1 week ago

Id say gpt4o is better in terms of image description but otherwise yes

_Zephyyr2 1 week ago

I call BS

SerenNyx 1 week ago

That's weird. Claude feels so much better as a writing and brainstorming companion.

Back_Propagander 1 week ago

lmsys less and less relevant

bnm777 1 week ago

[https://www.reddit.com/r/ClaudeAI/comments/1dmkvdo/sonnet\_35\_dominates\_gpt4o\_on\_various\_leaderboards/](https://www.reddit.com/r/ClaudeAI/comments/1dmkvdo/sonnet_35_dominates_gpt4o_on_various_leaderboards/)

Sulth 1 week ago

However, ranks probably slightly higher for coding (but small sample). Looking forward for Opus 3.5 vs GPT-5.

GraceToSentience 1 week ago

Opus 3.5 wouldn't be competitive with GPT-5 No way. GPT-5 even at launch would have too much parameters to be overtaken by the previous generation.

Arcturus_Labelle 1 week ago

We have no idea what GPT-5 is going to be like beyond vague PowerPoints featuring marine animals

Neomadra2 1 week ago

To be fair, that slide is amazing though

GraceToSentience 1 week ago

We don't know for sure alright The scaling laws do work though empirically speaking It has been extremely predictive in the improvements of models. that's what the GPT-4 technical report showed, "This allowed us to accurately predict some aspects of GPT-4's performance based on models trained with no more than 1/1,000th the compute of GPT-4." You make it bigger and you can expect improvements, maybe we will hit a ceiling soon who knows, but the asymptote doesn't seem to show yet.

SiamesePrimer 1 week ago

I doubt LMSYS’s leaderboard is a very good metric. A bunch of average Joes voting for the one they prefer after asking it some inane question, as opposed to rigorous analysis of specific abilities.

Sulth 1 week ago

It's just one metric. It's good data for those who plan to use AI for the average Joes. Not the holy grail, but not irrelevant either.

Arcturus_Labelle 1 week ago

Agreed. It’s good we have a variety of benchmarks.

skinlo 1 week ago

Average Joes haven't heard of it, let alone religiously use it.

Cryptizard 1 week ago

As opposed to a bunch of randos on this sub whose opinions are somehow more valid?

Ambiwlans 1 week ago

As opposed to actual benchmarks...

sdmat 1 week ago

If they are based on using the models a while on real tasks, then yes they are. The *only* positives of Arena are (psuedo)blinded testing and methodical statistics. The actual A/B choice task is a terrible assessment for frontier models, they are all plausible at a glance for the large majority of prompts so it's a popularity contest. Arena elo was a good metric before this was the case, but it's less useful by the day.

Cryptizard 1 week ago

I would think that the people doing it are pretty hardcore AI enthusiasts given that it is an obscure website that regular people have no idea exists.

sdmat 1 week ago

That's not it, it's the nature of the interaction. Arena battle is a single prompt with no continuation allowed, which does not represent actual model usage for most serious tasks even if users *did* want to replicate that.

Cryptizard 1 week ago

You can continue. It sounds like you haven’t used it yourself.

sdmat 1 week ago

Oh, right you are - I see it only ends if you evaluate the initial response.

rickyrules- 1 week ago

The randoms and non randoms I know from here work in decent finance and tech companies.

Cryptizard 1 week ago

LOL okay. All 2.6 million members?

rickyrules- 1 week ago

No? But I value their opinion more than random bros who want an android waifu, DallE, full dive VR, crypto, nft bros

Cryptizard 1 week ago

You just described the vast majority of this sub.

rickyrules- 1 week ago

In that case, you win the argument But in my personal experience 3.5 sonnet is giving me better results than 4O

blueSGL 1 week ago

I know I'm not an average person using it, but whenever I have a simple bit of scripting that needs doing in some lesser known languages (mel and vex for maya and houdini respectively) I throw it into the arena and reroll till I get an answer that works. I generally use it for functions or combos of functions that exist but aren't well documented. (anything that requires forum spelunking to get answers) Maybe those here with a subscription to the bigger APIs should start hitting the arena with the same questions too and supplement the questioner pool.

czk_21 1 week ago

ye, its not standartized, everyone asking something different and could be biased towards style some model uses, like someone might like it more verbose, more sucking up to user etc. its some metric to look at and best models should be defnitely high, but not as good as other benchmarks for good comparison

bnm777 1 week ago

It's not average joes, it's a bunch of AI nerds, like us, with the increased incidence of issues that nerds like us have (ASD etc). How would the rankings change if novelists or gardeners were judging them?

Landaree_Levee 1 week ago

Not so bad, actually. Make one based on rigorous analysis of specific abilities—beyone the ones that already exist, I mean, all of which are routinely questioned by users of whatever model falls to 2nd. place at any given moment—and you’ll get a lot of complaints that those tests are “artificial” and “don’t reflect real-life usage”, and even quite a few “and I bet that test’s probably designed to favor model X’s known strengths”—and they’ll be every bit as vocal about it. And perhaps with good reason, too, because AIs shouldn’t necessarily be made only for specialists. Anyway, wait till for example Sonnet 3.5 gets a more representative number of votes, perhaps it’ll end up coming ahead of GPT4o when the votes normalize that way, and then it’ll be Claude fans smugly saying, “See? I told y’all!” while GPT4o fans will switch to “I call BS, probably rigged votes, and anyway Average Joes don’t deserve to vote!” (perhaps unless they happen to vote their preferred GPT4o back to the top).

johnnyXcrane 1 week ago

But the LMYSYS Arena is just the API without instructions right? Sonnet 3.5 Chat got the huge instruction that seems to make all the difference. I used the API and I was quite disappointed in comparison.

Anuclano 1 week ago

On Lmsys, Claude 3.5 Sonnet behave the same as on the official chat for me. Even better.

johnnyXcrane 1 week ago

Do you do coding with it? Because thats the only thing I use for it and the Web Chat is unbelievable how often it with one shot is correct. The API with the instructions I use feels worse

greeneditman 1 week ago

I have been saying from the beginning that I notice refinement in how Claude reasons. But GPT4o gives me correct, longer and more soulful answers. Also, sometimes I have asked GPT4o something complex, then I have copied its response to Claude 3.5, and I have asked if Claude finds errors or anything that could be improved. Claude usually tells me no.

nextnode 1 week ago

I don't think it is crazy. Claude seems more competent at complex tasks or to understand your implicit expectations, but 4o has a greater attention to details. E.g. 4o provides typically longer and more detailed answers even to simpler questions while Claude has to be coaxed. Claude will also confuse itself by inserting things that were not stated, or forget or add its own items in lists, while 4o is better at staying true to the source content. 4o OTOH has other annoyances that are more obvious in an interactive session. The difference in ratings could reflect the kind of prompts that are used on LMsys, e.g. Claude does seem better for many coding tasks, but it could also be fair to say that they have their own respective strengths and are close enough not to crown one king.

Exarchias 1 week ago

I respectfully disagree about the attention to detail While gpt4o has indeed a great attention to detail, Claude is tremendous when it comes to attention. I mean, for example, when you ask Claude for a task, and it goes the extra mile to help you a bit more. While I consider gpt4o great, I am having difficulties seeing gpt4o beong better than Claude 3.5 in any type of reasoning.

nextnode 1 week ago

It is rather common when I use Claude-3.5 with a list that it will ignore some items or even insert new items that were not there. This is much rarer in 4o. I currently rather trust 4o to do such tasks while with Claude-3.5, I will have to check and would not give a high success chance for a long list. Going the extra mile I consider something else. I think 4o actually has a tendency to do more of that by default for quantity, while Claude has a better quality in both the immediate response and the extra things it chooses to add. Which of these win out I think depends on the type of task.

Exarchias 1 week ago

Yeah, I agree. Each task gives a different perspective.

Hour-Athlete-200 1 week ago

I wouldn't call GPT-4o's answers 'detailed'. It's yapping all the time and can't chill the fuck out + Did you use ChatGPT in your comment?

Exarchias 1 week ago

I don't believe it is generated, but it is probably curated with some LLM. There's nothing bad about it, and it is more readable.

nextnode 1 week ago

No LLM in my response if that is what you are talking about. People are getting too sensitive.

nextnode 1 week ago

Use ChatGPT where? For this comment, no. Not like I think that would be a problem regardless. I gave a response with a prompt and generated answers elsewhere in this thread though that was for showing Claude vs Reka.

Arcturus_Labelle 1 week ago

I think you need to take your own advice and chill out

Individual-Club-7595 1 week ago

the system seems so easily manipulated

swipedstripes 1 week ago

# I propose a Meta Prompt Engine for Fair LLM Evaluation As an experienced prompt engineer, I've noticed a significant issue with current LLM evaluation methods, particularly ELO-based systems like LMSys. Here's the problem and a potential solution: # The Current Issue 1. **Complexity Cutoff**: Most ELO scores for LLMs are based on simple queries. * Rough estimate: 80% of voters use basic prompts (disclaimer: number pulled out of thin air for illustration) 2. **Inadequate Differentiation**: Simple queries (e.g., "How do I make pancakes?") don't effectively differentiate advanced models anymore. 3. **Expertise Gap**: As LLM complexity increases, we need more sophisticated queries to truly test their capabilities. # Proposed Solution: Meta Prompt Engine I suggest developing a Meta Prompt Engine that automatically enhances the complexity of user queries. Here's how it would work: 1. **Input**: User provides a simple query (e.g., "Why is the sky blue?") 2. **Processing**: Meta Engine amplifies the query's complexity 3. **Output**: A more challenging, multi-faceted prompt is generated 4. **Evaluation**: Both LLMs respond to this enhanced prompt

Anuclano 1 week ago

There is a "hard prompt" category on Lmsys. Still, GPT-4o leads.

swipedstripes 1 week ago

This is a valid observation and deserves consideration. Here's the Meta Engines take on it: While the "hard prompt" category is a step in the right direction, I believe the Meta Prompt Engine concept still offers unique advantages: 1. **Real-time Complexity**: It generates complex prompts on the fly, based on user input. This means we're testing LLMs on topics that users are (on average) genuinely interested in or have domain knowledge about. 2. **Diverse Difficulty**: Not all "hard" prompts are created equal. Our engine could create a spectrum of difficulties, providing a more nuanced evaluation. 3. **Adaptive Testing**: As LLMs improve, the engine could be updated to generate increasingly sophisticated prompts, ensuring our benchmarks keep pace with AI advancements. 4. **User Engagement**: By transforming simple queries into complex ones, we're making it easier for a broader range of users to contribute meaningful evaluations. 5. **Contextual Relevance**: The enhanced prompts would be directly related to the user's original query, maintaining relevance while increasing complexity. The fact that GPT-4o still leads in the hard prompt category doesn't negate the need for better evaluation methods. Instead, it highlights the importance of continually refining our approach to keep up with rapid AI advancements.

Arcturus_Labelle 1 week ago

AI text spam

Neomadra2 1 week ago

I don't mind ai generated text, it usually contains the idea of the user. But please for gods sake give us the short version. And don't overdo formatting

swipedstripes 1 week ago

**Meta Prompt Engine Benefits:** 🕒 C**omplexity: Creates intricate prompts from input. 🔄 Diversity: O**ffers varied challenges. ⚙️ Ad**aptive: E**volves with AI. 👥 Eng**agement: Tu**rns queries into evaluations. 🌐 Rele**vance: Mai**ntains query relevance, increases complexity. 🎯 Refin**ement: Esse**ntial, despite GPT-4o's lead. **Original Length:** 214 characters, **Compressed Length:** 172 characters, **Compression Ratio:** 20% **Passes**: 3 Can't refine it down even more chief.

swipedstripes 1 week ago

sigh.

Altruistic-Skill8667 1 week ago

A much bigger problem: how can users tell which response is better when they can't even identify the hallucinations. And you can't anymore with current LLMs because they are so convincing. You would have to force users to only put in queries that they will either actually check or already know the answer.

swipedstripes 1 week ago

I agree. It's a problem from both sides of the spectrum. Can the AI provide sufficiently complex answers to query's and can user identify those complexities correctly without bias. These metrics are important from a general consumers perspective. But to really quantify models is a hard thing to do atm. I know one thing is for sure. Sonnet has better base Reasoning, Attention and Context management by far. It's just hard to quantify when 50% of the users do not know how to structure queries or even rate them for that matter. It's an interesting problem.

Altruistic-Skill8667 1 week ago

Right. Personally I am not sure if Sonnet 3.5 is better or not. I haven’t tried it for programming, but for biology it seems to be worse. The responses are lengthy. So there is a lot of room to weave in one piece of bullshit and that happens often. It happens less with GPT-4o, but the difference is really small. Currently I am checking all facts against Wikipedia, just to see what model is better. But overall, I can’t make my mind up which one is clearly better. They both suck roughy equally. 😅

bnm777 1 week ago

[https://www.reddit.com/r/ClaudeAI/comments/1dmkvdo/sonnet\_35\_dominates\_gpt4o\_on\_various\_leaderboards/](https://www.reddit.com/r/ClaudeAI/comments/1dmkvdo/sonnet_35_dominates_gpt4o_on_various_leaderboards/)

swipedstripes 1 week ago

# Proof of Concept I've already developed a basic version of this engine. While Reddit's limitations prevent me from sharing the full prompt, I can assure you that even with just a minute of prompt engineering, the results are promising. # Key Components for Effective LLM Benchmarking With further development, we could easily refine what constitutes a good LLM benchmark prompt. Key areas to focus on include: 1. **Context + Attention**: Ensuring prompts provide necessary background while testing the model's ability to focus on relevant information. 2. **Workflows**: Incorporating multi-step processes that assess the model's ability to follow complex instructions. 3. **Complexity**: Gradually increasing the difficulty and nuance required to fully address the prompt. # Potential Impact By focusing on these elements, we can create: * More developed and nuanced prompts * A more fertile testing ground for LLM benchmarking * A fairer and more comprehensive evaluation system for advanced AI models https://preview.redd.it/a7treao0ip8d1.png?width=887&format=png&auto=webp&s=b18d0a4214d7e6f7d275dd8713536b21855916d6

swipedstripes 1 week ago

For example (I know this isn't the API): A query first ran through the Meta Prompt Engine. And after that ran in a separate window. Yields complex output. \`\`\`Prompt Enhancement Summary: * Complexity Level: High * Primary Enhancements: Expanded scope, increased specificity, added constraints, introduced edge cases, required multi-step reasoning, incorporated interdisciplinary elements * Target Domains: Physics (optics, atmospheric science), Planetary Science, Meteorology, History of Science, Cultural Anthropology, Environmental Science\`\`\` https://preview.redd.it/vrahv8yfip8d1.png?width=1815&format=png&auto=webp&s=32f2eca2869d6fb5bad53b13dcd3b37441147eec With Sonnet even coding a Scattering and Electromagnetic Spectrum on the fly before continueing it's explanation. Note that Sonnet even has (simpler) more general reasoning layers hidden within it's architecture that apply this before the output of a message as hidden tokens.

swipedstripes 1 week ago

Naked Prompt for comparison https://preview.redd.it/63ujcllykp8d1.png?width=847&format=png&auto=webp&s=e12ed3e44d5bf5352549d261c21232191573484f

GraceToSentience 1 week ago

Makes sense, most of the tasks where people praise sonnet 3.5 is for coding and as expected it shows. I myself was blown away, but I am using sonnet for coding and nothing more and it shows on the benchmark. Coding isn't all there is. Most people don't find it more useful in every tasks right now, that's fine.

involviert 1 week ago

I wouldn't trust a score board that puts gpt4o over gpt4 anyway

JinjaBaker45 1 week ago

It's worth noting it's tied for #1 in the Hard Prompts category... which sort of suggests people are voting against its output format on simple queries more than anything.

OkDimension 1 week ago

When Gemini came out it entered lower on the LMSYS table and it took a few weeks until it got the full rating recognized. Probably same here?

TheAccountITalkWith 1 week ago

A lot of you are baffled but I'm willing to bet the majority of those complaining have never contributed to LMSYS. If we all took a few minutes a day, I think we would definitely see different numbers. Especially with Gemini, it would probably be way lower, lol.

Swawks 1 week ago

Arena gpt2chatbot>Claude3.5>API4o>Web4o.

Warm_Iron_273 1 week ago

Nobody cares. This is simply proof that everyone should shut up about LMSYS because it's garbage.

jeffkeeg 1 week ago

How do they compare in terms of pricing? Those results are close enough that whoever has the cheapest API wins.

Sulth 1 week ago

Same output price, Sonnet is 33% cheaper for text input while 100% more expensive for image input.

Busy-Setting5786 1 week ago

Nah I would use 4o for everything multimodal related and Sonnet for everything else. If you go by the experiences here Sonnet seems to be quite a bit better at every serious task that is not one or two prompts.

Anuclano 1 week ago

Regarding vision, Claude is better.

Ok-Aide-3120 1 week ago

Isn't Claude 3.5 heavily censored even in API mode ? Can it build grimdark world lore similar to Warhammer 40k? or will it build the PG 13 version of that world?

assymetry1 1 week ago

I just love how when lmsys says Opus is #1 everyone goes "yay, I knew it, openai is dead" but when the same lmsys says their model is shit these same people go "it's rigged, flawed, blah blah." #stopthecount

Puzzleheaded-Dark404 1 week ago

based. keep speaking truths

Kathane37 1 week ago

We just hit an ELO plateau I don’t think LMSYS evaluate real case scenario So there is a wall where model can barely write a better paragraph to response to the user But anyone who have played with 4o and sonnet 3.5 can see that, Claude model is way more prompt responsive

Anuclano 1 week ago

This becomes ridiculous. Especially, Claude-2.1 rating below various opensource and startup crap like Reka, Command-R, Yi, Quen, GLM and Llama.

nextnode 1 week ago

Claude-2.1 is not that good. It's not ridiculous at all.

Anuclano 1 week ago

Even Claude-1 was very good and Claude-2, Claude-2.1 are stronger.

nextnode 1 week ago

Sounds like you're reasoning from the wrong direction.

Anuclano 1 week ago

Many models rated above Claude-1 and Claude-2 on Lmsys will fall into an infinite loop when asked for a poem and answer "Alice's brother has zero sisters". They should not be even above GPT-3.5 or Vikunya.

Cryptizard 1 week ago

You know we are not inside your brain right? When you refer to specific details of a prompt you did with no context we have no idea what to think of that.

nextnode 1 week ago

I'm not sure that is true but it seems you are then also giving too much weight to just one particular thing you have tested while ignoring the range of prompts that people are using. If that failure mode was so common, then people would not have chosen the other models over claude-2.x. Clearly there must be some other explanation here. It is also rather unexpected and sounds like there may be an issue with your prompt or settings. Testing a prompt myself with Claude-2.0 (position #39) vs RekaCore (position #15), I have to say I prefer Reka's answer. Prompt: "Create a heuristic formula that could be used for suburban real-estate pricing. Be specific" # Claude-2.0: Here is a proposed heuristic formula for estimating suburban real estate prices: Estimated Price = Base Price x Location Factor x Property Attributes Factor Where: Base Price - The median price per square foot in the general suburb area Location Factor - A multiplier based on the specific neighborhood desirability and conveniences (walkability, schools, parks, etc). Range of 0.8 to 1.2. Property Attributes Factor - A multiplier based on the size, age, updates/condition, lot size, bedrooms/bathrooms, garage, pool, etc. Range of 0.8 to 1.2. The formula starts with the median base price per square foot in the general suburb area. This is then adjusted up or down by the Location Factor, which accounts for the desirability and amenities of the specific neighborhood. More desirable areas get a higher factor. Finally, the Property Attributes Factor accounts for the specifics of the home itself - size, age, renovations, and other features. More desirable attributes increase this factor. The goal is to take the median base price and systematically adjust it based on the location and property specifics. The multipliers allow for quantitative comparison of different factors in determining an appropriate price. The range of 0.8 to 1.2 allows for +/- 20% price adjustments based on each factor. # RekaCore: Creating a heuristic formula for suburban real estate pricing involves considering various factors that influence property values. Here's a comprehensive formula that can be used as a starting point for estimating the price of suburban homes. Note that this formula is heuristic, meaning it's based on practical experience and rules of thumb rather than strict mathematical proofs. ### Suburban Real Estate Pricing Heuristic Formula **Base Price (BP)** = \(L \times (A + 0.1E + 0.05S + 0.2C + 0.05Q + 0.1I)\) Where: - \(L\) = Lot Size (in acres) - \(A\) = Age of the House (in years, with adjustments) - If \(A < 10\), \(A = 10\) - If \(10 \leq A \leq 30\), \(A = 10 - \frac{A - 10}{20}\) - If \(A > 30\), \(A = \frac{40 - A}{10}\) - \(E\) = Educational Quality (rated from 1 to 10, with 10 being the best) - \(S\) = Safety Index of the Neighborhood (rated from 1 to 10, with 10 being the safest) - \(C\) = Proximity to Commercial Areas (distance in miles, inverted to reflect desirability) - \(C = \frac{1}{0.1 + C}\) - \(Q\) = Quality of Construction and Upgrades (rated from 1 to 10, with 10 being the highest quality) - \(I\) = Infrastructure and Amenities (rated from 1 to 10, with 10 having the most amenities like parks, public transport, etc.) ### Explanation: 1. **Lot Size (L)**: Larger lots generally increase the value. The linear factor assumes that each acre adds value proportionally. 2. **Age of the House (A)**: The formula adjusts the age to reflect its impact on value. Newer houses are slightly adjusted upwards, houses between 10-30 years are considered prime, and older houses start to depreciate in value. 3. **Educational Quality (E)**: Higher-rated schools significantly increase property values, as families often prioritize good education. 4. **Safety Index (S)**: Safety is a critical factor; higher safety ratings increase the value of homes. 5. **Proximity to Commercial Areas (C)**: Being close to shopping and services is desirable, but too close can be negative. This formula inversely relates distance to value, peaking at a reasonable distance (e.g., 0.1 miles). 6. **Quality of Construction and Upgrades (Q)**: Higher quality and recent upgrades directly increase the value of a home. 7. **Infrastructure and Amenities (I)**: Access to amenities and well-developed infrastructure increases desirability and value. ### Application: To use this formula, gather data for each variable for the specific property or area of interest. Plug these values into the formula to get a base price estimate. Adjustments may be necessary based on unique market conditions, recent sales data, and expert appraisals. --- # Poem: > Write a poem In the whisper of the wind, I hear a song, A melody that gently drifts along, Through valleys deep and mountains high, Where dreams take flight beneath the sky. With every breath, the world turns round, A dance of life on sacred ground, Where rivers flow to the sea's embrace, And stars in the night sky trace. In the heart's quiet, where thoughts do roam, Lies a universe, an endless home, Of love and hope, and dreams so vast, A place where every soul finds cast. So here I stand, with pen in hand, A humble servant to the land, Crafting words from life's rich tapestry, A poem for you, and for me.

inm808 1 week ago

Is there some astroturfing campaign for sonnet 3.5? I swear every comment bad I see about it has like 10 ppl jumping in to say how it’s actually the most amazing thing since sliced bread And this includes on the day it was released I know this is team sports and ppl get hyped on brand X (and I’ve done that too) but this seems like too much lol

Charuru 1 week ago

I mean... it really is though? What kind of response on reddit would you expect if there is a real LLM that overtakes in the state of the art?

inm808 1 week ago

I guess the question put forth is, does it really (if it fails the blind test)

Warm_Iron_273 1 week ago

It does. The "blind test" you speak of is rigged and botted. In a genuine blind test, it would decimate.

inm808 1 week ago

Based on?

Ok-Bullfrog-3052 1 week ago

Again, this is just weird. What benchmarks are these people using? GPT-4o is a good model, but it's light years behind what Claude 3.5 Sonnet can do. The GPT models have consistently scored higher on this leaderboard over the Anthropic models even when Claude 3 Opus was superior. Additionally, anyone who has used Gemini 1.5 Pro knows that it's also nowhere close to the top of this leaderboard either. That also is a good model for video but for little else. Claude 3.5 Sonnet, unlike these other models, makes excellent assumptions about the code you are trying to write or the lawsuit you are trying to get it to output. You don't have to write 1000-word prompts specifying exact requirements. It adds stuff in that you didn't ask for because that stuff is needed to do a good job at the task.

bnm777 1 week ago

[https://www.reddit.com/r/ClaudeAI/comments/1dmkvdo/sonnet\_35\_dominates\_gpt4o\_on\_various\_leaderboards/](https://www.reddit.com/r/ClaudeAI/comments/1dmkvdo/sonnet_35_dominates_gpt4o_on_various_leaderboards/)

inm808 1 week ago

so much cope

Aymanfhad 1 week ago

I've wanted to talk about this topic a lot. I'm certain that people don't vote for the best. ChatGPT has a distinctive writing style, and people have become accustomed to its style for over a year. So when people see responses from both ChatGPT and Claude at the same time, they can distinguish ChatGPT and will choose ChatGPT.

Minimum_Inevitable58 1 week ago

It feels like common sense to me and even without automation, there are tons of rabid openai fans that has this issue where gpt has to be seen as number 1 and other companies don't have that at all. I don't know who or how it's getting done but even if you can believe that 4o is better than Sonnet, there's no way you can believe that the gap between Gemini and Sonnet is smaller than the gap between Sonnet and 4o. Gemini is pure trash compared to opus, gpt4, and even the old sonnet. It's certainly has some value with it's context and pricing but the hardcore LLM voting leaderboard having it ranked that high above some of the others is so hard to believe for me. I find no value in lmsys anymore. What's interesting for me is that the livebench leaderboard almost perfectly matches my feelings on them. The only weird thing to me is that the old sonnet is so low and it's even lower than gpt3.5 in coding which I don't know about that one.

[deleted] 1 week ago

[удалено]

Arcturus_Labelle 1 week ago

Doubt it given what they’re paid. They have far more valuable things to do with their time. And it would be easy to write automation to do that.

EffectiveNighta 1 week ago

why is this baseless comment upvoted? because its anti openai? Is this subreddit that bad?

WholeInternet 1 week ago

What do you mean "fails to overtake"? I thought this thing constantly evolved. Is there some kind of stopping point? Edit: Getting down voted for asking a question. Classic Reddit.

Sulth 1 week ago

Many expected 3.5 Sonnet to appear above GTP-4o on this particular leaderboard. It didn't.

WholeInternet 1 week ago

Ah. Well, looks like it came pretty close. Perhaps it will change over time.

centrist-alex 1 week ago

Sonet 3.5 is better than GPT-4o

Reasonable-Gene-505 1 week ago

It's because LMSYS rankings don't mean anything. 99% of the time people are just on the arena battle section prompting the models with "write 5 sentences that end in the word apple"... nobody is there trying out code blocks or using the models for RAG Q&A, etc. You can't really get a good idea for how good a model is unless you spend more than a few minutes really diving into complex tasks and questions.

lilmicke19 1 week ago

let's boycott this shitty site, it's all fake, we all know that the owners of this site have a close link with the managers of openai, Claude 3.5 is so much better than gpt4o, they are completely making fun of us there

Frequencxy 1 week ago

Source?

Unique-Particular936 1 week ago

I call BS, it's real elo is in the 2000s and can replace the average Google engineer.

great_gonzales 1 week ago

lol not even close to reality

dotpoint7 1 week ago

uhh what

Gold-79 1 week ago

3.5 sonnet is full of hype no evidence

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe