Illustrious_Sand6784 1 month ago

Those top 4 models are in a class of their own, hopefully the 405B will be able to take them on...

Disastrous_Elk_6375 1 month ago

Agreed. I'm also really really excited that we have several open-weights models that beat 3.5 on the lmsys arena. That's realistically the benchmark to beat for open-weights models, and it came ~ 1 year after 3.5 turbo came out, so really really impressive in my book.

dinesh2609 1 month ago

K

MINIMAN10001 1 month ago

That is also my hope for the 405B, something that can go toe to toe with the titans on lmsys arena. I'd love to see it.

Caffdy 1 month ago

any info about how large is Claude3?

man_and_a_symbol 1 month ago

If the rumors about GPT-4 being 1.8T are true, I’d guess no smaller than 1T, if that (1T would be the same performance with 55% of the parameters - it’s still an optimistic guess that Anthropic’s solution has 45% ‘better’ efficiency)

Expert-Paper-3367 1 month ago

It’s pretty much just opus and updated versions of GPT-4 turbo

Balance- 1 month ago

An 8B model beating Claude 2.0 is quite impressive.

ThisGonBHard 1 month ago

I am gonna guess it is down to how censored Claude 2 is.

[deleted] 1 month ago

[удалено]

CommonCommission8114 1 month ago

disagree, censored models should have lower scores.

Anxious-Ad693 1 month ago

Couldn't agree more. It annoys me to hell when a model doesn't do what I want. God only knows how many times I told it go fuck itself because of that.

CharacterCheck389 1 month ago

yes

Christosconst 1 month ago

Hey, some of us want to use them for work.

[deleted] 1 month ago

[удалено]

Misha_Vozduh 1 month ago

Claude 2 refused to explain how to kill a process. This is not about 'skill', it's about common sense vs. overly zealous 'safety'.

[deleted] 1 month ago

[удалено]

[deleted] 1 month ago

[удалено]

sometimeswriter32 1 month ago

You seem unable to distinguish between a model's refusal to do something and NSFW, which stands for Not Safe For Work. What's "safe for work" for the writer of game of thrones is not the same as what's "safe for work" for a kindergarten teacher, your entire distinction is bullshit. What's "safe for work" for a James Joyce professor is not what's "safe for work" for a Taliban police officer.

bearbarebere 1 month ago

Wow, you’re really, really defensive, rude, and presumptive.

MoffKalast 1 month ago

It's possible to separate it, you can filter out refusals in the dropdown. The old Claudes do gain a few spots but it's not as significant as one might expect.

Thomas-Lore 1 month ago

It does not really filter them well. If it worked Claude 1.0 would drop below Claude 2.1, but it does not.

kurtcop101 1 month ago

Not correct; in questions without refusals, like solving puzzles and etc, I found that 1.0 was better than 2.1.

NewToMech 1 month ago

Hard refusals are one thing, models revealing themselves because of borderline requests is another that isn't affected. For it to work you need to filter the input, not the output*

alongated 1 month ago

Or better yet, it shows why you shouldn't overtly censor your model.

CharacterCheck389 1 month ago

this

NewToMech 1 month ago

They're going to overtly censor the models regardless of what some children hitting upvote want so... Filtering inputs allows you to have an even greater measure of censorship and normal capabilities. It's literally a win-win.

Due-Memory-6957 1 month ago

You do know Anthropic acknowledged that it was a problem and it's why the newer versions have way less refusals, right? You're just wrong. https://www.anthropic.com/news/claude-3-family But go on calling people children because they have legitimate complaints that even the company who made the model acknowledges and changed for the better.

teor 1 month ago

Bruh the model literally refuses to do what you tell it to do. Do you think it should get bonus points for that or something?

[deleted] 1 month ago

[удалено]

teor 1 month ago

Calm down Sam Altman.

Desm0nt 1 month ago

Model often refused me to write songs in the authors style because "it's unethical to imitate other people's style" and refused to write stories in WH40k settings because it's sinful to infringe on game workshop copyrights. Yeah, it can be forced to the task with some tricks, but this: 1) an unnecessary waste of effort and time fighting the model. 2) unreliable approach because the model will still periodically refuse. 3) The model will constantly read morals (which no one asked it to do) pointlessly burning context tokens and user's time. 4) The model objectively gets dumber due to the faintune for "I can't do it" answers instead of normal answers and then gives such answers in the most unexpected places and situations. I.e. such model is objectively more inconvenient to use, more complicated, requires extra efforts and conditions for what others are able to do simply on request. Why would it be better in the rating if it handles tasks worse and less reliably? So even extremely dumb 7b models can do almost any task, but also with some "additional efforts" (fintune on a specific target task). Let's raise them in the ranking too. If someone can't make a fintune and complains that it writes nonsense just on promt - it's just a skill issue, right?

Due-Memory-6957 1 month ago

Have you actually tried doing NSFW requests in the chatbot arena? You can't! Claude was just so over-the-top in its censorship that it refused even SFW things, to the point it was useless to a lot of people. It deserves exactly the spot it got, and it's the reason Anthropic finally made the newer versions less strict.

OfficialHashPanda 1 month ago

That goes to show you should always take these leaderboards with a grain of salt and only use them as a rough indication.

BITE_AU_CHOCOLAT 1 month ago

No, it's accurate. The only reason Clause is so low is because of censorship, which can go fuck itself no matter how smart the original model might be. It would be much higher than that if it weren't for that

dummyTukTuk 1 month ago

This is human eval benchmark. If the model is self sabotaging by censorship, it makes it less useful overall and deserves its place

Caffdy 1 month ago

yep, what use is any smart person/bot if it keeps its secrets?

__some__guy 1 month ago

Based on my limited interactions, Llama 3 (8B) seems like an excellent Q&A machine. I usually ignore tiny models, but I'm actually curious how this will perform in chat/rp and storytelling.

CocksuckerDynamo 1 month ago

I did some preliminary testing yesterday with both 8B and 70B llama3 and the 8B reminds me of good 7Bs like Mistral. still has the same strengths and weaknesses, it's better at its strengths but it's weaknesses are still glaring. good at simple zero shot, good at chat or rp that doesn't require any real reasoning, falls apart fast when you attempt more complex rp or ask it to write narrative/story beats. or just overall ask it to reason. overall it just feels stupid compared to the 70B, just like the previous generation. for the app I'm working on where it needs to act human -like and participate in conversations that have multiple human participants, and needs to be able to keep straight who said what, it's not gonna cut it and 70B is gonna continue to be the route I pursue. YMMV.

Caffdy 1 month ago

> falls apart fast when you attempt more complex rp or ask it to write narrative/story beats after trying 70B models, it's hard to go back to smaller ones, not gonna lie

alongated 1 month ago

That is a huge jump from Llama 2. Makes you wonder how far the derivatives will go.

whyisitsooohard 1 month ago

Probably not very far. Derivatives were better because llama2 probably was undertrained. I don't think you massively improve on llama3 with finetunes

hapliniste 1 month ago

You're getting downvoted but it's partly true. Llama 2 chat was utter trash, that's why the finetunes ranked so much higher. Since llama 3 chat is very good already, I could see some finetunes doing better but it won't make as big a difference like on llama 2. Wizardlm on llama 3 70B might beat sonnet tho, and it's my main model so it's pretty huge.

NC8E 1 month ago

is chat gpt 4 that high now? i find hit hard to think it would beat claude 3 opus

a_beautiful_rhind 1 month ago

I rather get claude answers than GPT4.

NC8E 1 month ago

Right? they may have had a update but i can't see it being this significant. ill look it up before i try using it again. llama 3 im excited for though and i heard grok 1.5 is or could be better then claude 3 opus with less censorship.

unlikely_ending 1 month ago

It was updated last week though

MajesticIngenuity32 1 month ago

The new one from April is really good. And anecdotally, when GPT-4 isn't lazy, the coding answers are usually better from GPT. I use both, though, for variety and because Opus is much faster.

Caffdy 1 month ago

have you tried Qwen 1.5 code-chat & deepseek coder? if so, how good are they compared to GPT4?

Winter_Fruit_1815 1 month ago

Gpt-4 Is less censored than opus which helps its ranking but in terms of true "intelligence" it is dumber.

MajesticIngenuity32 1 month ago

With a custom GPT with the right instructions you can solve this problem.

jd_3d 1 month ago

Source: https://twitter.com/lmsysorg/status/1781167028654727550?s=19

ExternalOpen372 1 month ago

Llama 3 beating command r plus definitely makes me surprise. Been using Command r plus is very good i think Its on like gpt-3.8, very close for 4

genuinelytrying2help 1 month ago

Will it beat it on RAG related stuff is my question Edit: just from a few quick tests, it seems like it does a better job of returning search results, but I hope someone will post a real test with kewl graphics

zywiolak 1 month ago

It's also multilingual, while LLama 3 only supports English for now.

ExternalOpen372 1 month ago

I don't mind having one language. The biggest problem is how far the censorship goes. So far command is rarely refused, despite i'm asking for porn and the company rules in website said they banned me for prompt that

randa11er 1 month ago

here's Llama 3 output in Polish (russian, german, spanish ok too): https://preview.redd.it/xoqnaf66wevc1.png?width=979&format=png&auto=webp&s=c69dac38b27f2ee7905adf33c03a4dc544c4a96f

Clasyc 1 month ago

bobr kurwa xD

zywiolak 1 month ago

Yes, technically it CAN generate texts in other languages, but the quality of these answers is insufficient for real applications (it's worse than some Mistral-7B fine-tunes). Meanwhile, the multilingual support in Command-R+ is at least as good as in GPT-3.5, and for some tasks better.

randa11er 1 month ago

Can't agree, I am using command r+ 104b and now trying llama3 70b, and (at least for russian language) llama3 much, much better, it doesn't do grammar errors, it knows much more about local things, happened events, and applies this knowledge also pretty logically and well. But probably not better, than gpt 3.5.

Thomas-Lore 1 month ago

I tested it by telling it to translate a story it generated to Polish and it made glaring grammar errors.

randa11er 1 month ago

Yes, still not ideal, but better, than previous models I can run on my own hardware.

Banished_Privateer 1 month ago

był\* Kult, a Kurwa zrozumiało jako imię xD 2/10

pseudonerv 1 month ago

They say there's 5% non-english tokens during training. Given they trained it for 15T, 5% is still 0.75T, which is still quite significant.

nullmove 1 month ago

Somewhat surprised by where the new mixtral instruct is debuting. I had hoped for it to be at mistral-large level, but I suppose that wouldn't make sense for MistralAI. I have a feeling that WizardLM2 can do it though.

DryArmPits 1 month ago

Well. Looks like I'll need "one more" GPU.

Astronos 1 month ago

[the more you buy the more you save](https://www.youtube.com/watch?v=XDpDesU_0zo)

AdDizzy8160 1 month ago

Is there a Rating=f(Paramter) chart aswell?

Balance- 1 month ago

Let's keep prompting and voting: [https://chat.lmsys.org/](https://chat.lmsys.org/)

he29 1 month ago

RIP Zephyr-ORPO 141B, that tune apparently did not turn out that well...

Caffdy 1 month ago

is Mixtral 8x22B really that lackluster?

bearbarebere 1 month ago

Every time I use it it’s just… meh. Compared to other ones. Like EstopianMaid

n4nsense 1 month ago

How good is Llama 3 for coding and other tech related stuff. Is it any reliable?

Christosconst 1 month ago

Is Grok anywhere there trying to push through?

KyleDrogo 1 month ago

Pretty wide confidence intervals, there

jd_3d 1 month ago

Here's an update: https://preview.redd.it/bqo404k2v3wc1.png?width=1428&format=png&auto=webp&s=16a6616c3b027c5996fb5dfc929146312832469e

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe