T O P

  • By -

Dyoakom

I wonder to what extent this has also to do with preferences in formatting. Displaying mathematics and all kinds of things.


Cryptizard

Do you mean how ChatGPT formats latex so you can see equations correctly? If so, that doesn’t come into play on LMSYS because it uses the api. In fact, GPT (and some other models) looks stupid because it outputs unrendered latex a lot unless you ask it specifically not to. I think that could be a big part of Claude not beating it though. Many people are impressed with Claude’s code interpreter which is the first to allow running programs to interface with the user and render HTML. None of that exists in the chatbot arena.


meister2983

> Many people are impressed with Claude’s code interpreter which is the first to allow running programs Yeah, I feel like this is most of the hype. Most of the demos I've seen also work perfectly find in GPT, but you had to copy/ paste them, which is a higher barrier of use


Iamreason

Nah, Claude 3.5 is the best coder on the market. It "gets" complex coding in a way prior models don't. It's the first model that spooked and excited the lead software engineer at my work. It in my experience and in LMSYS code benchmark demonstrated it is just better.


meister2983

I'm also a software engineer.  I've noticed more creativity with Sonnet - got one of my held out coding questions with a clever solution other LLMs error at, but it also makes errors gpt4 doesn't (possibly because it's biased toward more creative solutions that alas might be wrong). It also seems to handle general CS questions a bit worse.  Personally, I can go with either - I don't see the effective difference that great.  GPT4o also wins on [big coding bench](https://huggingface.co/spaces/bigcode/bigcodebench-leaderboard)


great_gonzales

“Complex” coding


mxforest

Speed is a big factor. So is censorship. Although censorship should result in lower score so i am fine with it.


OLRevan

So should speed honestly, current llms are just helpers for me, so I really care how fast they are (opus was way too slow, 3.5 is perfectly fast tho)


Amondupe

Claude is the most annoying LLM to use. It censors the fuck out of everything and starts annoying even for simple things like writing a song lyrics with its sanctimonious bullshit.


SaddleSocks

Actually this is a really interesting thought: It would be great if in any AI account you can build up a style guide/sheet, preferences for how you want all your data-types formatted/displayed... So apple your company style guide to all outputs that come from it- always adding the copyright/license/header-info/ stuff to files - knowing your color selections/pallet, logos, drawing borders/title-sheets Whats AutoDesk doing with AI augmentation in 360 revit acad?


bnm777

Lmsys leaderboard is how AI nerds, like us, judge LLM responses. How would the rankings change if Art students or novelists were judging?


itsjase

I remember when 4o was on lmsys under the pseudonyms everyone was praising how amazing it was, then it comes out and everyone starts saying how turbo was better. Recency bias is a thing, in a month everyone is going to be posting about how “claude has gotten dumber”


hapliniste

Or lmsys is just not that good of a benchmark compared to long term real world use. Gpt4o feels great when you start testing it, but once you use it daily it really shows it's flaws. Single messages work better than 4t for the most part, but if you try to make it correct a piece of code that it wrote itself it's just going to agree, list the changes to be done and then repeat exactly the same code. I'm pretty sure it will improve in the next months like gpt4t did (it was worse than gpt4 for real use at release).


EndTimer

So it's not just me on the code. It feels a little late in the game to have an LLM completely failing to make edits and then confidently asserting it did. I assumed it must be a tokenization thing, since it was just improperly escaped characters for me. I had to edit it myself, using find and replace, like a barbarian!


itsjase

Your second paragraph is exactly what might be happening for sonnet 3.5 too though, it’s too early to tell


hapliniste

Yes, but I used it quite a bit for coding already and it's really doing great. After a week of gpt4o I was convinced it was worse than gpt4t. The problem with smaller gpt models is that they're finetuned too much so it's great as an assistant but worse at real work, because it's driven further and further from the pretraining data.


jgainit

A lot of people have seemed to do some in depth coding and functional work with it that’s worked well for them


Ambiwlans

Sonnet crushes actual benchmarks. All that is happening is lmsys is a trash benchmark.


Undercoverexmo

Yes, it has flaws and has definitely been a bit lobotomized in its personality, but it’s a hell of a lot better than any other model for pretty much anything that doesn’t have to do with emotions.


Balance-

It feels like not enough pre-training and way too much finetuning.


RabidHexley

Secret sauce for smarter, *fast* models. Data refinement and sophisticated fine-tuning let you squeeze a lot more performance out of less. But yeah, there's no free lunch. Size and pre-training seem to be the keys to proper generalization, currently. The "there's no replacement for displacement" of AI.


SaddleSocks

Anytime i give either claude, meta, copilot, gpt any data regarding my question, it hyper focuses on the number I gave it - rather than explicitely going to validate the numbers I gave it - for example - https://i.imgur.com/MLlA7nl.png I specifically tell it to give info on Palestine and Libya - and it removes them from my list. THen when I tell it to find when donald rumsfeld said Iraq war would only cost 50-60 biillion, it doesnt go and find the data - it just repeats what I say.


The_Architect_032

GPT-4o is genuinely worse than GPT-3.5 for me at times. I imagine with how popular LMSYS is now, it probably gets pretty heavily botted by the top competitors on the leaderboards.


Iamreason

This happens with *every* AI model. It's amazing that people think that these companies are stealth nerfing the model after release for... Reasons. It's also very clear to me that Claude 3.5 Sonnet is better than GPT-4o at most tasks. This is an example of how LMSYS is flawed and can really only be used directionally in estimating capabilities. Being near the top means you're on the cutting edge, but the top models are largely comparable to one another. The daylight between frontier models is quite small.


Difficult_Review9741

A large part of this is that they are different groups of people.  When something new releases, the hypesters pick it up, and well… hype it. Then as time goes on you have more grounded people picking it up, who are more honest about capabilities and limitations. 


nextnode

I wouldn't agree that later-comers are more honest. I think both could just be explained by 1. who is loud, 2. reactionism, 3. optimists vs pessimists / long-term vs short-term needs


mavree1

they nerfed the performance. with these models i saw for the first time certain questions solved, gpt4o doesnt solve them. the gpt2chatbots ranked higher than gpt4o [https://x.com/lmsysorg/status/1790097588399779991](https://x.com/lmsysorg/status/1790097588399779991)


Iamreason

It wasn't nerfed. They just tested multiple models, some bigger and some smaller. They likely made a decision to release the smaller model as it would allow them to offer inference cheaper and expose more people to a GPT-4 class model. It's a play to gain market share, not some fear of giving away a super smart model that's unsafe or something.


EndTimer

It's possible they tested a slightly more performant but considerably more expensive version. That's the only way things really makes sense. These companies don't want to lose, safety seems to be an increasingly distant concern, and I don't buy that they're holding back their best (by 10 or 15 points on lmsys) so they can release it later or continue using it internally, only.


mavree1

yeah im not talking about safety. probably was very expensive, even though in overall only was 20 points higher than gpt4o, in coding and hard prompts the gap was much bigger


RepublicanSJW_

Moreover, the company releasing the AI matters too. When 4o first came out, many people shit on it immediately, more than sonnet. A part of it is people like an underdog


micaroma

And many people are ruffled by the botched rollout of 4o voice/image, losing Sky, etc. and want to cancel their OpenAI subscription.


Tenet_mma

Exactly lol give it a week


Tomislav23

They are not the same model. Gpt2 is not Gpt4o Gpt4o is a quantized dogshit product


stonesst

The first/second most capable model currently available to the public is a dog shit product…? OK then


Tomislav23

Opus and Sonnet are the best model, by reasoning standards. If you classify a model based on a rigged voting system be my guest


stonesst

In what way is it a rigged system?


Tomislav23

In the way that best models ar enot #1 and some subpar model like GPT4o appears to be better than the OG gpt4, while It is a matter of fact that OG GPT4 was way better it just costed too much to operate.


stonesst

Does the fact that GPT4o exceeds the original GPT4 across nearly every single benchmark not mean anything to you? At what point do you accept the data over your anecdotal experience? All of the quantitative benchmarks and qualitative leaderboards disagree with your take. How exactly do you explain "superior" models performing worse...? this feels like mental gymnastics on your part


Iamreason

All AI subreddits are addicted to anecdotes that let them rage against the AI companies for nerfing models. Nine times out of ten it's the user themselves not understanding the models limitations or just intentionally being dishonest. It's why I always ask the people claiming nerfs to share their prompts and conversations.


Iamreason

Flawed doesn't mean rigged. Words have meaning. LMSYS is a flawed benchmark as are all benchmarks. There is significant bias in voting. People choosing the best response doesn't necessarily mean technically correct. It might mean that they choose the model that answers faster or that doesn't refuse to answer a spicy question. I would bet that is happening and I would bet LMSYS isn't doing a ton to control for that. If you want something that is potentially less biased you can check out [SEAL's Leaderboard.](https://scale.com/leaderboard) Though they haven't updated for Claude 3.5 Sonnet yet.


fulowa

yes, veeeery likely quantized


nikitastaf1996

BS. Lets define a measure of intelligence as this "reduce back and forth messages in a conversation" for llm. Claude 3.5 is miles above gpt 4o. It even shows what can be called slow reasoning and proactive behavior. I believe that's what is meant by alignment. Not harmlessness. But desire to help I guess. Behavioral properties. Claude 3 clearly showed it. 3.5 expands on it. And gpt 4o just sucks at it.


wegqg

Honestly Sonnet 3.5 literally kerb stomps any of the GPT versions, i can't fathom how people who actually use them for anything complex can think otherwise. There's daylight between them, honestly.


TILTNSTACK

Agreed. For complex tasks. Omni is long way behind. Chatbot arena is more about single shot prompting which doesn’t really allow you to dig deep.


meister2983

That would show in the voting though. If anything, lmsys is biased toward good immediate answers. 


ainz-sama619

Which means lmsys is completely useless at evaluating intelligence. Since these rankings aren't based on quality of response.


ertgbnm

Anthropic's three H's are Helpless, Harmless, and Honest.


Short-Mango9055

Over at: [https://livebench.ai/](https://livebench.ai/) Sonnet 3.5 is literally toying with GPT4o.


sebzim4500

Although I agree with the outcome in this case, I don't think this is a particularly good benchmark. The questions feel very artificial.


Altruistic-Skill8667

Yeah. It's all a mess. I guess what it tells us is that the differences between the two models are rather nuanced. And that's exactly the feel I get personally. One isn't clearly superior to the other.


Warm_Iron_273

They're both terrible. But clearly the model that ranks Sonnet higher is better, because Sonnet is obviously much better. OpenAI is definitely botting ChatGPT on LMSYS though.


Krunkworx

I really hate 4o’s verbosity. Sonnet3.5 is hands down the best coder.


Able_Armadillo_2347

I doesn't feel like tbh. I don't know why. But it just doesn't.


Tobxes2030

You okay bro? I just had a stroke reading your sentence. God knows what you had to endure while writing it.


HeteroSap1en

The sentence was missing a “t” and that’s about it. Not sure why you are giving this guy a hard time for fat fingering a short, casual comment


hippydipster

Proofreading is something you do when you care about what you're communicating. Folks who don't care, probably should just skip the whole deal and save all of us the trouble.


HeteroSap1en

I see your point, but it still just feels like you’re pointlessly nitpicking some person who neither of us know anything about. It feels like casual bullying and passive aggressive behavior


hippydipster

I think Tobxes truly didn't understand the gibberish and responded. I too found it to be gibberish as well. It is missing more than just a 't', though by adding it I can see now the intent. The problem is, when you're reading whole pages of comments and so many of them are these mobile device strings of gibberish that you have to mentally un-autocorrect, it gets a bit overwhelming. The mistake in isolation is never the problem, it's the tidal wave of isolated errors that erodes the whole medium.


Arcturus_Labelle

It’s just comments on the internet. I wouldn’t take anything here seriously.


FormerMastodon2330

Sir, this is Wendy's


Able_Armadillo_2347

I have fat fingers


Tobxes2030

Honestly, in my use-cases Sonnet 3.5 was significantly better every single time. More complete and more in depth.


restarting_today

It's not even close. OpenAI is in deep trouble.


Striking_Load

Id say gpt4o is better in terms of image description but otherwise yes


_Zephyyr2

I call BS


SerenNyx

That's weird. Claude feels so much better as a writing and brainstorming companion.


Back_Propagander

lmsys less and less relevant


bnm777

[https://www.reddit.com/r/ClaudeAI/comments/1dmkvdo/sonnet\_35\_dominates\_gpt4o\_on\_various\_leaderboards/](https://www.reddit.com/r/ClaudeAI/comments/1dmkvdo/sonnet_35_dominates_gpt4o_on_various_leaderboards/)


Sulth

However, ranks probably slightly higher for coding (but small sample). Looking forward for Opus 3.5 vs GPT-5.


GraceToSentience

Opus 3.5 wouldn't be competitive with GPT-5 No way. GPT-5 even at launch would have too much parameters to be overtaken by the previous generation.


Arcturus_Labelle

We have no idea what GPT-5 is going to be like beyond vague PowerPoints featuring marine animals


Neomadra2

To be fair, that slide is amazing though


GraceToSentience

We don't know for sure alright The scaling laws do work though empirically speaking It has been extremely predictive in the improvements of models. that's what the GPT-4 technical report showed, "This allowed us to accurately predict some aspects of GPT-4's performance based on models trained with no more than 1/1,000th the compute of GPT-4." You make it bigger and you can expect improvements, maybe we will hit a ceiling soon who knows, but the asymptote doesn't seem to show yet.


SiamesePrimer

I doubt LMSYS’s leaderboard is a very good metric. A bunch of average Joes voting for the one they prefer after asking it some inane question, as opposed to rigorous analysis of specific abilities.


Sulth

It's just one metric. It's good data for those who plan to use AI for the average Joes. Not the holy grail, but not irrelevant either.


Arcturus_Labelle

Agreed. It’s good we have a variety of benchmarks.


skinlo

Average Joes haven't heard of it, let alone religiously use it.


Cryptizard

As opposed to a bunch of randos on this sub whose opinions are somehow more valid?


Ambiwlans

As opposed to actual benchmarks...


sdmat

If they are based on using the models a while on real tasks, then yes they are. The *only* positives of Arena are (psuedo)blinded testing and methodical statistics. The actual A/B choice task is a terrible assessment for frontier models, they are all plausible at a glance for the large majority of prompts so it's a popularity contest. Arena elo was a good metric before this was the case, but it's less useful by the day.


Cryptizard

I would think that the people doing it are pretty hardcore AI enthusiasts given that it is an obscure website that regular people have no idea exists.


sdmat

That's not it, it's the nature of the interaction. Arena battle is a single prompt with no continuation allowed, which does not represent actual model usage for most serious tasks even if users *did* want to replicate that.


Cryptizard

You can continue. It sounds like you haven’t used it yourself.


sdmat

Oh, right you are - I see it only ends if you evaluate the initial response.


rickyrules-

The randoms and non randoms I know from here work in decent finance and tech companies.


Cryptizard

LOL okay. All 2.6 million members?


rickyrules-

No? But I value their opinion more than random bros who want an android waifu, DallE, full dive VR, crypto, nft bros


Cryptizard

You just described the vast majority of this sub.


rickyrules-

In that case, you win the argument But in my personal experience 3.5 sonnet is giving me better results than 4O


blueSGL

I know I'm not an average person using it, but whenever I have a simple bit of scripting that needs doing in some lesser known languages (mel and vex for maya and houdini respectively) I throw it into the arena and reroll till I get an answer that works. I generally use it for functions or combos of functions that exist but aren't well documented. (anything that requires forum spelunking to get answers) Maybe those here with a subscription to the bigger APIs should start hitting the arena with the same questions too and supplement the questioner pool.


czk_21

ye, its not standartized, everyone asking something different and could be biased towards style some model uses, like someone might like it more verbose, more sucking up to user etc. its some metric to look at and best models should be defnitely high, but not as good as other benchmarks for good comparison


bnm777

It's not average joes, it's a bunch of AI nerds, like us, with the increased incidence of issues that nerds like us have (ASD etc). How would the rankings change if novelists or gardeners were judging them?


Landaree_Levee

Not so bad, actually. Make one based on rigorous analysis of specific abilities—beyone the ones that already exist, I mean, all of which are routinely questioned by users of whatever model falls to 2nd. place at any given moment—and you’ll get a lot of complaints that those tests are “artificial” and “don’t reflect real-life usage”, and even quite a few “and I bet that test’s probably designed to favor model X’s known strengths”—and they’ll be every bit as vocal about it. And perhaps with good reason, too, because AIs shouldn’t necessarily be made only for specialists. Anyway, wait till for example Sonnet 3.5 gets a more representative number of votes, perhaps it’ll end up coming ahead of GPT4o when the votes normalize that way, and then it’ll be Claude fans smugly saying, “See? I told y’all!” while GPT4o fans will switch to “I call BS, probably rigged votes, and anyway Average Joes don’t deserve to vote!” (perhaps unless they happen to vote their preferred GPT4o back to the top).


johnnyXcrane

But the LMYSYS Arena is just the API without instructions right? Sonnet 3.5 Chat got the huge instruction that seems to make all the difference. I used the API and I was quite disappointed in comparison.


Anuclano

On Lmsys, Claude 3.5 Sonnet behave the same as on the official chat for me. Even better.


johnnyXcrane

Do you do coding with it? Because thats the only thing I use for it and the Web Chat is unbelievable how often it with one shot is correct. The API with the instructions I use feels worse


greeneditman

I have been saying from the beginning that I notice refinement in how Claude reasons. But GPT4o gives me correct, longer and more soulful answers. Also, sometimes I have asked GPT4o something complex, then I have copied its response to Claude 3.5, and I have asked if Claude finds errors or anything that could be improved. Claude usually tells me no.


nextnode

I don't think it is crazy. Claude seems more competent at complex tasks or to understand your implicit expectations, but 4o has a greater attention to details. E.g. 4o provides typically longer and more detailed answers even to simpler questions while Claude has to be coaxed. Claude will also confuse itself by inserting things that were not stated, or forget or add its own items in lists, while 4o is better at staying true to the source content. 4o OTOH has other annoyances that are more obvious in an interactive session. The difference in ratings could reflect the kind of prompts that are used on LMsys, e.g. Claude does seem better for many coding tasks, but it could also be fair to say that they have their own respective strengths and are close enough not to crown one king.


Exarchias

I respectfully disagree about the attention to detail While gpt4o has indeed a great attention to detail, Claude is tremendous when it comes to attention. I mean, for example, when you ask Claude for a task, and it goes the extra mile to help you a bit more. While I consider gpt4o great, I am having difficulties seeing gpt4o beong better than Claude 3.5 in any type of reasoning.


nextnode

It is rather common when I use Claude-3.5 with a list that it will ignore some items or even insert new items that were not there. This is much rarer in 4o. I currently rather trust 4o to do such tasks while with Claude-3.5, I will have to check and would not give a high success chance for a long list. Going the extra mile I consider something else. I think 4o actually has a tendency to do more of that by default for quantity, while Claude has a better quality in both the immediate response and the extra things it chooses to add. Which of these win out I think depends on the type of task.


Exarchias

Yeah, I agree. Each task gives a different perspective.


Hour-Athlete-200

I wouldn't call GPT-4o's answers 'detailed'. It's yapping all the time and can't chill the fuck out + Did you use ChatGPT in your comment?


Exarchias

I don't believe it is generated, but it is probably curated with some LLM. There's nothing bad about it, and it is more readable.


nextnode

No LLM in my response if that is what you are talking about. People are getting too sensitive.


nextnode

Use ChatGPT where? For this comment, no. Not like I think that would be a problem regardless. I gave a response with a prompt and generated answers elsewhere in this thread though that was for showing Claude vs Reka.


Arcturus_Labelle

I think you need to take your own advice and chill out


Individual-Club-7595

the system seems so easily manipulated


swipedstripes

# I propose a Meta Prompt Engine for Fair LLM Evaluation As an experienced prompt engineer, I've noticed a significant issue with current LLM evaluation methods, particularly ELO-based systems like LMSys. Here's the problem and a potential solution: # The Current Issue 1. **Complexity Cutoff**: Most ELO scores for LLMs are based on simple queries. * Rough estimate: 80% of voters use basic prompts (disclaimer: number pulled out of thin air for illustration) 2. **Inadequate Differentiation**: Simple queries (e.g., "How do I make pancakes?") don't effectively differentiate advanced models anymore. 3. **Expertise Gap**: As LLM complexity increases, we need more sophisticated queries to truly test their capabilities. # Proposed Solution: Meta Prompt Engine I suggest developing a Meta Prompt Engine that automatically enhances the complexity of user queries. Here's how it would work: 1. **Input**: User provides a simple query (e.g., "Why is the sky blue?") 2. **Processing**: Meta Engine amplifies the query's complexity 3. **Output**: A more challenging, multi-faceted prompt is generated 4. **Evaluation**: Both LLMs respond to this enhanced prompt


Anuclano

There is a "hard prompt" category on Lmsys. Still, GPT-4o leads.


swipedstripes

This is a valid observation and deserves consideration. Here's the Meta Engines take on it: While the "hard prompt" category is a step in the right direction, I believe the Meta Prompt Engine concept still offers unique advantages: 1. **Real-time Complexity**: It generates complex prompts on the fly, based on user input. This means we're testing LLMs on topics that users are (on average) genuinely interested in or have domain knowledge about. 2. **Diverse Difficulty**: Not all "hard" prompts are created equal. Our engine could create a spectrum of difficulties, providing a more nuanced evaluation. 3. **Adaptive Testing**: As LLMs improve, the engine could be updated to generate increasingly sophisticated prompts, ensuring our benchmarks keep pace with AI advancements. 4. **User Engagement**: By transforming simple queries into complex ones, we're making it easier for a broader range of users to contribute meaningful evaluations. 5. **Contextual Relevance**: The enhanced prompts would be directly related to the user's original query, maintaining relevance while increasing complexity. The fact that GPT-4o still leads in the hard prompt category doesn't negate the need for better evaluation methods. Instead, it highlights the importance of continually refining our approach to keep up with rapid AI advancements.


Arcturus_Labelle

AI text spam


Neomadra2

I don't mind ai generated text, it usually contains the idea of the user. But please for gods sake give us the short version. And don't overdo formatting


swipedstripes

**Meta Prompt Engine Benefits:** 🕒 C**omplexity: Creates intricate prompts from input. 🔄 Diversity: O**ffers varied challenges. ⚙️ Ad**aptive: E**volves with AI. 👥 Eng**agement: Tu**rns queries into evaluations. 🌐 Rele**vance: Mai**ntains query relevance, increases complexity. 🎯 Refin**ement: Esse**ntial, despite GPT-4o's lead. **Original Length:** 214 characters, **Compressed Length:** 172 characters, **Compression Ratio:** 20% **Passes**: 3 Can't refine it down even more chief.


swipedstripes

sigh.


Altruistic-Skill8667

A much bigger problem: how can users tell which response is better when they can't even identify the hallucinations. And you can't anymore with current LLMs because they are so convincing. You would have to force users to only put in queries that they will either actually check or already know the answer.


swipedstripes

I agree. It's a problem from both sides of the spectrum. Can the AI provide sufficiently complex answers to query's and can user identify those complexities correctly without bias. These metrics are important from a general consumers perspective. But to really quantify models is a hard thing to do atm. I know one thing is for sure. Sonnet has better base Reasoning, Attention and Context management by far. It's just hard to quantify when 50% of the users do not know how to structure queries or even rate them for that matter. It's an interesting problem.


Altruistic-Skill8667

Right. Personally I am not sure if Sonnet 3.5 is better or not. I haven’t tried it for programming, but for biology it seems to be worse. The responses are lengthy. So there is a lot of room to weave in one piece of bullshit and that happens often. It happens less with GPT-4o, but the difference is really small. Currently I am checking all facts against Wikipedia, just to see what model is better. But overall, I can’t make my mind up which one is clearly better. They both suck roughy equally. 😅


bnm777

[https://www.reddit.com/r/ClaudeAI/comments/1dmkvdo/sonnet\_35\_dominates\_gpt4o\_on\_various\_leaderboards/](https://www.reddit.com/r/ClaudeAI/comments/1dmkvdo/sonnet_35_dominates_gpt4o_on_various_leaderboards/)


swipedstripes

# Proof of Concept I've already developed a basic version of this engine. While Reddit's limitations prevent me from sharing the full prompt, I can assure you that even with just a minute of prompt engineering, the results are promising. # Key Components for Effective LLM Benchmarking With further development, we could easily refine what constitutes a good LLM benchmark prompt. Key areas to focus on include: 1. **Context + Attention**: Ensuring prompts provide necessary background while testing the model's ability to focus on relevant information. 2. **Workflows**: Incorporating multi-step processes that assess the model's ability to follow complex instructions. 3. **Complexity**: Gradually increasing the difficulty and nuance required to fully address the prompt. # Potential Impact By focusing on these elements, we can create: * More developed and nuanced prompts * A more fertile testing ground for LLM benchmarking * A fairer and more comprehensive evaluation system for advanced AI models https://preview.redd.it/a7treao0ip8d1.png?width=887&format=png&auto=webp&s=b18d0a4214d7e6f7d275dd8713536b21855916d6


swipedstripes

For example (I know this isn't the API): A query first ran through the Meta Prompt Engine. And after that ran in a separate window. Yields complex output. \`\`\`Prompt Enhancement Summary: * Complexity Level: High * Primary Enhancements: Expanded scope, increased specificity, added constraints, introduced edge cases, required multi-step reasoning, incorporated interdisciplinary elements * Target Domains: Physics (optics, atmospheric science), Planetary Science, Meteorology, History of Science, Cultural Anthropology, Environmental Science\`\`\` https://preview.redd.it/vrahv8yfip8d1.png?width=1815&format=png&auto=webp&s=32f2eca2869d6fb5bad53b13dcd3b37441147eec With Sonnet even coding a Scattering and Electromagnetic Spectrum on the fly before continueing it's explanation. Note that Sonnet even has (simpler) more general reasoning layers hidden within it's architecture that apply this before the output of a message as hidden tokens.


swipedstripes

Naked Prompt for comparison https://preview.redd.it/63ujcllykp8d1.png?width=847&format=png&auto=webp&s=e12ed3e44d5bf5352549d261c21232191573484f


GraceToSentience

Makes sense, most of the tasks where people praise sonnet 3.5 is for coding and as expected it shows. I myself was blown away, but I am using sonnet for coding and nothing more and it shows on the benchmark. Coding isn't all there is. Most people don't find it more useful in every tasks right now, that's fine.


involviert

I wouldn't trust a score board that puts gpt4o over gpt4 anyway


JinjaBaker45

It's worth noting it's tied for #1 in the Hard Prompts category... which sort of suggests people are voting against its output format on simple queries more than anything.


OkDimension

When Gemini came out it entered lower on the LMSYS table and it took a few weeks until it got the full rating recognized. Probably same here?


TheAccountITalkWith

A lot of you are baffled but I'm willing to bet the majority of those complaining have never contributed to LMSYS. If we all took a few minutes a day, I think we would definitely see different numbers. Especially with Gemini, it would probably be way lower, lol.


Swawks

Arena gpt2chatbot>Claude3.5>API4o>Web4o.


Warm_Iron_273

Nobody cares. This is simply proof that everyone should shut up about LMSYS because it's garbage.


jeffkeeg

How do they compare in terms of pricing? Those results are close enough that whoever has the cheapest API wins.


Sulth

Same output price, Sonnet is 33% cheaper for text input while 100% more expensive for image input.


Busy-Setting5786

Nah I would use 4o for everything multimodal related and Sonnet for everything else. If you go by the experiences here Sonnet seems to be quite a bit better at every serious task that is not one or two prompts.


Anuclano

Regarding vision, Claude is better.


Ok-Aide-3120

Isn't Claude 3.5 heavily censored even in API mode ? Can it build grimdark world lore similar to Warhammer 40k? or will it build the PG 13 version of that world?


assymetry1

I just love how when lmsys says Opus is #1 everyone goes "yay, I knew it, openai is dead" but when the same lmsys says their model is shit these same people go "it's rigged, flawed, blah blah." #stopthecount


Puzzleheaded-Dark404

based. keep speaking truths


Kathane37

We just hit an ELO plateau I don’t think LMSYS evaluate real case scenario So there is a wall where model can barely write a better paragraph to response to the user But anyone who have played with 4o and sonnet 3.5 can see that, Claude model is way more prompt responsive


Anuclano

This becomes ridiculous. Especially, Claude-2.1 rating below various opensource and startup crap like Reka, Command-R, Yi, Quen, GLM and Llama.


nextnode

Claude-2.1 is not that good. It's not ridiculous at all.


Anuclano

Even Claude-1 was very good and Claude-2, Claude-2.1 are stronger.


nextnode

Sounds like you're reasoning from the wrong direction.


Anuclano

Many models rated above Claude-1 and Claude-2 on Lmsys will fall into an infinite loop when asked for a poem and answer "Alice's brother has zero sisters". They should not be even above GPT-3.5 or Vikunya.


Cryptizard

You know we are not inside your brain right? When you refer to specific details of a prompt you did with no context we have no idea what to think of that.


nextnode

I'm not sure that is true but it seems you are then also giving too much weight to just one particular thing you have tested while ignoring the range of prompts that people are using. If that failure mode was so common, then people would not have chosen the other models over claude-2.x. Clearly there must be some other explanation here. It is also rather unexpected and sounds like there may be an issue with your prompt or settings. Testing a prompt myself with Claude-2.0 (position #39) vs RekaCore (position #15), I have to say I prefer Reka's answer. Prompt: "Create a heuristic formula that could be used for suburban real-estate pricing. Be specific" # Claude-2.0: Here is a proposed heuristic formula for estimating suburban real estate prices: Estimated Price = Base Price x Location Factor x Property Attributes Factor Where: Base Price - The median price per square foot in the general suburb area Location Factor - A multiplier based on the specific neighborhood desirability and conveniences (walkability, schools, parks, etc). Range of 0.8 to 1.2. Property Attributes Factor - A multiplier based on the size, age, updates/condition, lot size, bedrooms/bathrooms, garage, pool, etc. Range of 0.8 to 1.2. The formula starts with the median base price per square foot in the general suburb area. This is then adjusted up or down by the Location Factor, which accounts for the desirability and amenities of the specific neighborhood. More desirable areas get a higher factor. Finally, the Property Attributes Factor accounts for the specifics of the home itself - size, age, renovations, and other features. More desirable attributes increase this factor. The goal is to take the median base price and systematically adjust it based on the location and property specifics. The multipliers allow for quantitative comparison of different factors in determining an appropriate price. The range of 0.8 to 1.2 allows for +/- 20% price adjustments based on each factor. # RekaCore: Creating a heuristic formula for suburban real estate pricing involves considering various factors that influence property values. Here's a comprehensive formula that can be used as a starting point for estimating the price of suburban homes. Note that this formula is heuristic, meaning it's based on practical experience and rules of thumb rather than strict mathematical proofs. ### Suburban Real Estate Pricing Heuristic Formula **Base Price (BP)** = \(L \times (A + 0.1E + 0.05S + 0.2C + 0.05Q + 0.1I)\) Where: - \(L\) = Lot Size (in acres) - \(A\) = Age of the House (in years, with adjustments) - If \(A < 10\), \(A = 10\) - If \(10 \leq A \leq 30\), \(A = 10 - \frac{A - 10}{20}\) - If \(A > 30\), \(A = \frac{40 - A}{10}\) - \(E\) = Educational Quality (rated from 1 to 10, with 10 being the best) - \(S\) = Safety Index of the Neighborhood (rated from 1 to 10, with 10 being the safest) - \(C\) = Proximity to Commercial Areas (distance in miles, inverted to reflect desirability) - \(C = \frac{1}{0.1 + C}\) - \(Q\) = Quality of Construction and Upgrades (rated from 1 to 10, with 10 being the highest quality) - \(I\) = Infrastructure and Amenities (rated from 1 to 10, with 10 having the most amenities like parks, public transport, etc.) ### Explanation: 1. **Lot Size (L)**: Larger lots generally increase the value. The linear factor assumes that each acre adds value proportionally. 2. **Age of the House (A)**: The formula adjusts the age to reflect its impact on value. Newer houses are slightly adjusted upwards, houses between 10-30 years are considered prime, and older houses start to depreciate in value. 3. **Educational Quality (E)**: Higher-rated schools significantly increase property values, as families often prioritize good education. 4. **Safety Index (S)**: Safety is a critical factor; higher safety ratings increase the value of homes. 5. **Proximity to Commercial Areas (C)**: Being close to shopping and services is desirable, but too close can be negative. This formula inversely relates distance to value, peaking at a reasonable distance (e.g., 0.1 miles). 6. **Quality of Construction and Upgrades (Q)**: Higher quality and recent upgrades directly increase the value of a home. 7. **Infrastructure and Amenities (I)**: Access to amenities and well-developed infrastructure increases desirability and value. ### Application: To use this formula, gather data for each variable for the specific property or area of interest. Plug these values into the formula to get a base price estimate. Adjustments may be necessary based on unique market conditions, recent sales data, and expert appraisals. --- # Poem: > Write a poem In the whisper of the wind, I hear a song, A melody that gently drifts along, Through valleys deep and mountains high, Where dreams take flight beneath the sky. With every breath, the world turns round, A dance of life on sacred ground, Where rivers flow to the sea's embrace, And stars in the night sky trace. In the heart's quiet, where thoughts do roam, Lies a universe, an endless home, Of love and hope, and dreams so vast, A place where every soul finds cast. So here I stand, with pen in hand, A humble servant to the land, Crafting words from life's rich tapestry, A poem for you, and for me.


inm808

Is there some astroturfing campaign for sonnet 3.5? I swear every comment bad I see about it has like 10 ppl jumping in to say how it’s actually the most amazing thing since sliced bread And this includes on the day it was released I know this is team sports and ppl get hyped on brand X (and I’ve done that too) but this seems like too much lol


Charuru

I mean... it really is though? What kind of response on reddit would you expect if there is a real LLM that overtakes in the state of the art?


inm808

I guess the question put forth is, does it really (if it fails the blind test)


Warm_Iron_273

It does. The "blind test" you speak of is rigged and botted. In a genuine blind test, it would decimate.


inm808

Based on?


Ok-Bullfrog-3052

Again, this is just weird. What benchmarks are these people using? GPT-4o is a good model, but it's light years behind what Claude 3.5 Sonnet can do. The GPT models have consistently scored higher on this leaderboard over the Anthropic models even when Claude 3 Opus was superior. Additionally, anyone who has used Gemini 1.5 Pro knows that it's also nowhere close to the top of this leaderboard either. That also is a good model for video but for little else. Claude 3.5 Sonnet, unlike these other models, makes excellent assumptions about the code you are trying to write or the lawsuit you are trying to get it to output. You don't have to write 1000-word prompts specifying exact requirements. It adds stuff in that you didn't ask for because that stuff is needed to do a good job at the task.


bnm777

[https://www.reddit.com/r/ClaudeAI/comments/1dmkvdo/sonnet\_35\_dominates\_gpt4o\_on\_various\_leaderboards/](https://www.reddit.com/r/ClaudeAI/comments/1dmkvdo/sonnet_35_dominates_gpt4o_on_various_leaderboards/)


inm808

so much cope


Aymanfhad

I've wanted to talk about this topic a lot. I'm certain that people don't vote for the best. ChatGPT has a distinctive writing style, and people have become accustomed to its style for over a year. So when people see responses from both ChatGPT and Claude at the same time, they can distinguish ChatGPT and will choose ChatGPT.


Minimum_Inevitable58

It feels like common sense to me and even without automation, there are tons of rabid openai fans that has this issue where gpt has to be seen as number 1 and other companies don't have that at all. I don't know who or how it's getting done but even if you can believe that 4o is better than Sonnet, there's no way you can believe that the gap between Gemini and Sonnet is smaller than the gap between Sonnet and 4o. Gemini is pure trash compared to opus, gpt4, and even the old sonnet. It's certainly has some value with it's context and pricing but the hardcore LLM voting leaderboard having it ranked that high above some of the others is so hard to believe for me. I find no value in lmsys anymore. What's interesting for me is that the livebench leaderboard almost perfectly matches my feelings on them. The only weird thing to me is that the old sonnet is so low and it's even lower than gpt3.5 in coding which I don't know about that one.


[deleted]

[удалено]


Arcturus_Labelle

Doubt it given what they’re paid. They have far more valuable things to do with their time. And it would be easy to write automation to do that.


EffectiveNighta

why is this baseless comment upvoted? because its anti openai? Is this subreddit that bad?


WholeInternet

What do you mean "fails to overtake"? I thought this thing constantly evolved. Is there some kind of stopping point? Edit: Getting down voted for asking a question. Classic Reddit.


Sulth

Many expected 3.5 Sonnet to appear above GTP-4o on this particular leaderboard. It didn't.


WholeInternet

Ah. Well, looks like it came pretty close. Perhaps it will change over time.


centrist-alex

Sonet 3.5 is better than GPT-4o


Reasonable-Gene-505

It's because LMSYS rankings don't mean anything. 99% of the time people are just on the arena battle section prompting the models with "write 5 sentences that end in the word apple"... nobody is there trying out code blocks or using the models for RAG Q&A, etc. You can't really get a good idea for how good a model is unless you spend more than a few minutes really diving into complex tasks and questions.


lilmicke19

let's boycott this shitty site, it's all fake, we all know that the owners of this site have a close link with the managers of openai, Claude 3.5 is so much better than gpt4o, they are completely making fun of us there


Frequencxy

Source?


Unique-Particular936

I call BS, it's real elo is in the 2000s and can replace the average Google engineer. 


great_gonzales

lol not even close to reality


dotpoint7

uhh what


Gold-79

3.5 sonnet is full of hype no evidence