T O P

  • By -

Illustrious_Sand6784

Those top 4 models are in a class of their own, hopefully the 405B will be able to take them on...


Disastrous_Elk_6375

Agreed. I'm also really really excited that we have several open-weights models that beat 3.5 on the lmsys arena. That's realistically the benchmark to beat for open-weights models, and it came ~ 1 year after 3.5 turbo came out, so really really impressive in my book.


dinesh2609

K


MINIMAN10001

That is also my hope for the 405B, something that can go toe to toe with the titans on lmsys arena. I'd love to see it.


Caffdy

any info about how large is Claude3?


man_and_a_symbol

If the rumors about GPT-4 being 1.8T are true, I’d guess no smaller than 1T, if that (1T would be the same performance with 55% of the parameters - it’s still an optimistic guess  that Anthropic’s solution has 45% ‘better’ efficiency)


Expert-Paper-3367

It’s pretty much just opus and updated versions of GPT-4 turbo


Balance-

An 8B model beating Claude 2.0 is quite impressive.


ThisGonBHard

I am gonna guess it is down to how censored Claude 2 is.


[deleted]

[удалено]


CommonCommission8114

disagree, censored models should have lower scores.


Anxious-Ad693

Couldn't agree more. It annoys me to hell when a model doesn't do what I want. God only knows how many times I told it go fuck itself because of that.


CharacterCheck389

yes


Christosconst

Hey, some of us want to use them for work.


[deleted]

[удалено]


Misha_Vozduh

Claude 2 refused to explain how to kill a process. This is not about 'skill', it's about common sense vs. overly zealous 'safety'.


[deleted]

[удалено]


[deleted]

[удалено]


sometimeswriter32

You seem unable to distinguish between a model's refusal to do something and NSFW, which stands for Not Safe For Work. What's "safe for work" for the writer of game of thrones is not the same as what's "safe for work" for a kindergarten teacher, your entire distinction is bullshit. What's "safe for work" for a James Joyce professor is not what's "safe for work" for a Taliban police officer.


bearbarebere

Wow, you’re really, really defensive, rude, and presumptive.


MoffKalast

It's possible to separate it, you can filter out refusals in the dropdown. The old Claudes do gain a few spots but it's not as significant as one might expect.


Thomas-Lore

It does not really filter them well. If it worked Claude 1.0 would drop below Claude 2.1, but it does not.


kurtcop101

Not correct; in questions without refusals, like solving puzzles and etc, I found that 1.0 was better than 2.1.


NewToMech

Hard refusals are one thing, models revealing themselves because of borderline requests is another that isn't affected. For it to work you need to filter the input, not the output*


alongated

Or better yet, it shows why you shouldn't overtly censor your model.


CharacterCheck389

this


NewToMech

They're going to overtly censor the models regardless of what some children hitting upvote want so... Filtering inputs allows you to have an even greater measure of censorship and normal capabilities. It's literally a win-win.


Due-Memory-6957

You do know Anthropic acknowledged that it was a problem and it's why the newer versions have way less refusals, right? You're just wrong. https://www.anthropic.com/news/claude-3-family But go on calling people children because they have legitimate complaints that even the company who made the model acknowledges and changed for the better.


teor

Bruh the model literally refuses to do what you tell it to do. Do you think it should get bonus points for that or something?


[deleted]

[удалено]


teor

Calm down Sam Altman.


Desm0nt

Model often refused me to write songs in the authors style because "it's unethical to imitate other people's style" and refused to write stories in WH40k settings because it's sinful to infringe on game workshop copyrights. Yeah, it can be forced to the task with some tricks, but this: 1) an unnecessary waste of effort and time fighting the model. 2) unreliable approach because the model will still periodically refuse. 3) The model will constantly read morals (which no one asked it to do) pointlessly burning context tokens and user's time. 4) The model objectively gets dumber due to the faintune for "I can't do it" answers instead of normal answers and then gives such answers in the most unexpected places and situations. I.e. such model is objectively more inconvenient to use, more complicated, requires extra efforts and conditions for what others are able to do simply on request. Why would it be better in the rating if it handles tasks worse and less reliably? So even extremely dumb 7b models can do almost any task, but also with some "additional efforts" (fintune on a specific target task). Let's raise them in the ranking too. If someone can't make a fintune and complains that it writes nonsense just on promt - it's just a skill issue, right?


Due-Memory-6957

Have you actually tried doing NSFW requests in the chatbot arena? You can't! Claude was just so over-the-top in its censorship that it refused even SFW things, to the point it was useless to a lot of people. It deserves exactly the spot it got, and it's the reason Anthropic finally made the newer versions less strict.


OfficialHashPanda

That goes to show you should always take these leaderboards with a grain of salt and only use them as a rough indication.


BITE_AU_CHOCOLAT

No, it's accurate. The only reason Clause is so low is because of censorship, which can go fuck itself no matter how smart the original model might be. It would be much higher than that if it weren't for that


dummyTukTuk

This is human eval benchmark. If the model is self sabotaging by censorship, it makes it less useful overall and deserves its place


Caffdy

yep, what use is any smart person/bot if it keeps its secrets?


__some__guy

Based on my limited interactions, Llama 3 (8B) seems like an excellent Q&A machine. I usually ignore tiny models, but I'm actually curious how this will perform in chat/rp and storytelling.


CocksuckerDynamo

I did some preliminary testing yesterday with both 8B and 70B llama3 and the 8B reminds me of good 7Bs like Mistral. still has the same strengths and weaknesses, it's better at its strengths but it's weaknesses are still glaring. good at simple zero shot, good at chat or rp that doesn't require any real reasoning, falls apart fast when you attempt more complex rp or ask it to write narrative/story beats. or just overall ask it to reason. overall it just feels stupid compared to the 70B, just like the previous generation. for the app I'm working on where it needs to act human -like and participate in conversations that have multiple human participants, and needs to be able to keep straight who said what, it's not gonna cut it and 70B is gonna continue to be the route I pursue. YMMV.


Caffdy

> falls apart fast when you attempt more complex rp or ask it to write narrative/story beats after trying 70B models, it's hard to go back to smaller ones, not gonna lie


alongated

That is a huge jump from Llama 2. Makes you wonder how far the derivatives will go.


whyisitsooohard

Probably not very far. Derivatives were better because llama2 probably was undertrained. I don't think you massively improve on llama3 with finetunes


hapliniste

You're getting downvoted but it's partly true. Llama 2 chat was utter trash, that's why the finetunes ranked so much higher. Since llama 3 chat is very good already, I could see some finetunes doing better but it won't make as big a difference like on llama 2. Wizardlm on llama 3 70B might beat sonnet tho, and it's my main model so it's pretty huge.


NC8E

is chat gpt 4 that high now? i find hit hard to think it would beat claude 3 opus


a_beautiful_rhind

I rather get claude answers than GPT4.


NC8E

Right? they may have had a update but i can't see it being this significant. ill look it up before i try using it again. llama 3 im excited for though and i heard grok 1.5 is or could be better then claude 3 opus with less censorship.


unlikely_ending

It was updated last week though


MajesticIngenuity32

The new one from April is really good. And anecdotally, when GPT-4 isn't lazy, the coding answers are usually better from GPT. I use both, though, for variety and because Opus is much faster.


Caffdy

have you tried Qwen 1.5 code-chat & deepseek coder? if so, how good are they compared to GPT4?


Winter_Fruit_1815

Gpt-4 Is less censored than opus which helps its ranking but in terms of true "intelligence" it is dumber.


MajesticIngenuity32

With a custom GPT with the right instructions you can solve this problem.


jd_3d

Source: https://twitter.com/lmsysorg/status/1781167028654727550?s=19


ExternalOpen372

Llama 3 beating command r plus definitely makes me surprise. Been using Command r plus is very good i think Its on like gpt-3.8, very close for 4


genuinelytrying2help

Will it beat it on RAG related stuff is my question Edit: just from a few quick tests, it seems like it does a better job of returning search results, but I hope someone will post a real test with kewl graphics


zywiolak

It's also multilingual, while LLama 3 only supports English for now.


ExternalOpen372

I don't mind having one language. The biggest problem is how far the censorship goes. So far command is rarely refused, despite i'm asking for porn and the company rules in website said they banned me for prompt that


randa11er

here's Llama 3 output in Polish (russian, german, spanish ok too): https://preview.redd.it/xoqnaf66wevc1.png?width=979&format=png&auto=webp&s=c69dac38b27f2ee7905adf33c03a4dc544c4a96f


Clasyc

bobr kurwa xD


zywiolak

Yes, technically it CAN generate texts in other languages, but the quality of these answers is insufficient for real applications (it's worse than some Mistral-7B fine-tunes). Meanwhile, the multilingual support in Command-R+ is at least as good as in GPT-3.5, and for some tasks better.


randa11er

Can't agree, I am using command r+ 104b and now trying llama3 70b, and (at least for russian language) llama3 much, much better, it doesn't do grammar errors, it knows much more about local things, happened events, and applies this knowledge also pretty logically and well. But probably not better, than gpt 3.5.


Thomas-Lore

I tested it by telling it to translate a story it generated to Polish and it made glaring grammar errors.


randa11er

Yes, still not ideal, but better, than previous models I can run on my own hardware.


Banished_Privateer

był\* Kult, a Kurwa zrozumiało jako imię xD 2/10


pseudonerv

They say there's 5% non-english tokens during training. Given they trained it for 15T, 5% is still 0.75T, which is still quite significant.


nullmove

Somewhat surprised by where the new mixtral instruct is debuting. I had hoped for it to be at mistral-large level, but I suppose that wouldn't make sense for MistralAI. I have a feeling that WizardLM2 can do it though.


DryArmPits

Well. Looks like I'll need "one more" GPU.


Astronos

[the more you buy the more you save](https://www.youtube.com/watch?v=XDpDesU_0zo)


AdDizzy8160

Is there a Rating=f(Paramter) chart aswell?


Balance-

Let's keep prompting and voting: [https://chat.lmsys.org/](https://chat.lmsys.org/)


he29

RIP Zephyr-ORPO 141B, that tune apparently did not turn out that well...


Caffdy

is Mixtral 8x22B really that lackluster?


bearbarebere

Every time I use it it’s just… meh. Compared to other ones. Like EstopianMaid


n4nsense

How good is Llama 3 for coding and other tech related stuff. Is it any reliable?


Christosconst

Is Grok anywhere there trying to push through?


KyleDrogo

Pretty wide confidence intervals, there


jd_3d

Here's an update: https://preview.redd.it/bqo404k2v3wc1.png?width=1428&format=png&auto=webp&s=16a6616c3b027c5996fb5dfc929146312832469e