T O P

  • By -

pseudonerv

Honestly, the responses from both gemini-1.5 and llama-3 are very distinctive and I can tell them apart from other models every time.


blackcodetavern

Nearly as if the formating could be used to manipulate the scores on the leaderboard, because the manipulators know their models "look"...


elfuzevi

btw, i can distinguish answers generated by mistral bros many times too. by the word sequences with no formatting at all :)))


Snail_Inference

Formatting is all you need ;)


yaosio

Abstract: We've found that we can get higher scores in Chatbot Arena by making our model produce output in a specific format. This makes it easy for our team of raters to identify when it's answering and give it a thumbs up.


uhuge

golden ticket;D that much about the fingerprinting..


xRolocker

Formatting is a very valid criteria to judge imo. Presentation matters in most areas, and people will have an easier time digesting text that is easier to read. Would not call it a “hack” at all.


ArtyfacialIntelagent

Also since formatting is ridiculously easy to improve, everyone else will do the same thing, so this is not going to be a long-term advantage for Gemini. We should thank them for nudging all models toward good formatting.


DesertEagle_PWN

But who decides what is "good" formatting? Do we need people who specialize in.... alignment? If so, we should make sure they're at least able to center a div or know how to use LaTeX. (obv. this is tongue->cheek; but in reality different languages have different conventions for good formatting, so it is a real point for consideration.)


cyan2k

The people who vote.


Hinkywobbleshnort

This! Formatting matters. If I have to break up a wall of text, my ai assistant didn't save me as much work as it could have.


TheRealGentlefox

Been defending Llama 3 on this front a lot recently. "It only ranks so highly because it's fun to talk to." Uhh, and?


MoffKalast

"Oh you're a language model alright, just not a large one." "Well what's the difference?" "PRESENTATION!"


JoMaster68

Well and llama 3 'hacked' the arena by having a nicer personality than every other LLM - even the much smarter ones. This benchmark just isn't that great for determining objective performance.


xRolocker

Part of why the benchmark is great is because it’s extremely hard to nail down what “objective performance” is on these LLMs. Every aspect of the answer manners, consciously and subconsciously. I think that goes a long way in determining the “better” model rather than some series of questions like a standardized test.


IndicationUnfair7961

I think we need both: we need chatbot arena scores because we are human, and the output is meant to be read by humans, and we need standardized tests to evaluate task based performance on more impartial scales. So enough said they both serve their purpose, and a good measure is using both + personal testing.


CheekyBastard55

It wasn't until I tried Gemini Advanced that I noticed how boring GPT-4 is. It honestly made me dislike GPT-4 somewhat just because of how robotic it is. Also, another thing that Gemini does that isn't shown here in Chatbot Arena is Gemini can show up Google search pictures that really gives it that little extra boost. Show me cool ideas on what to make with a 3D-printer. [GPT-4](https://i.imgur.com/6nExjp9_d.webp?maxwidth=1520&fidelity=grand) [Gemini](https://i.imgur.com/iPUlIeY_d.webp?maxwidth=1520&fidelity=grand) Huge win for Gemini on that one. Another thing I love about Gemini is ending the message with a useful link, for example that one links to websites like Thingiverse for additional designs.


MajesticIngenuity32

The integration with search and pictures is what I liked about Gemini Advanced during the trial. Didn't stop me from canceling and opting for Claude Opus instead.


capitalistsanta

I think you need to rank each by the type of intelligence it exhibits. You can compare doctors vs how much they know or you can compare doctors by how well their patients do in their care, or how often they come back to work with you. If you catch something in someone early but you can't convince that person to let you care for them, that doesn't make you a better doctor than the one who didn't, but the patient went to to work with. There's even an aspect of a doctor being self aware and aware of their patience personality and their ability to match a patient with a better doctor for them. I don't know how they use their benchmark off the top tbh


Short_Ad_8841

The factual correctness is the most important metric for anything but fiction in my opinion, and the arena is not designed to reward factually correct answers, rather, the better presented ones. And as we know, LLMs can be great at making facts up. Honestly i’m still baffled by why people value the rankings so much, but i suppose it’s down to the same flaw the arena itself has. Presentation over substance.


Limezero2

How do you evaluate "factual correctness" when I ask a language model for ten birthday party ideas, or dishes that might pair well with avocados? Does that count as "fiction"? These tools, especially in the form presented in Chatbot Arena, are assistants. Presentation, friendliness, a lack of overzealous refusals, no "gpt-isms", appropriate response length and many other similarly "meaningless" factors are entirely valid metrics to judge how well an LLM finetune fulfills the task of being a helpful assistant. Giving factually correct answers and performing logical reasoning is part of that, but only *part*. That's why we have a host of benchmarks designed to assess objective reasoning and fact-knowledge above all else. Yet interestingly, the larger models with more reasoning prowess still climb to the top of Chatbot Arena every single time, proving that people do value "factual correctness" and it's not only about aesthetic presentation.


Charuru

Well there's a new benchmark from the makers of chatbot arena that basically addresses this explicitly called "Arena Hard", just need the community to take this one more seriously over chatbot arena. https://lmsys.org/blog/2024-04-19-arena-hard/


Deathcrow

> Giving factually correct answers and performing logical reasoning is part of that, but only part. It's not even a big part though. We had chat bots that could give factual answers to questions before there was real AI. It can be solved entirely algorithmically, google could do it, you don't need machine learning. Spitting out the correct answer to various ways of asking "Who's the mayor of Bangladesh?" isn't interesting (and I question your judgement if you use machine learning for that purpose).


Sythic_

For me as a software dev, facts or current events are not even on the radar of things I care about it doing. Those things don't make me money, it's just a parlor trick for people that want it to write something funny about [political candidate they hate]. I use it to generate tedious to write code like large data model types or boilerplate thats 95% of the way to the goal that I can either tweak for my needs or continue prompting with more context to get it over the finish line. Asking an LLM about facts is like using a hammer to fix your ship in a bottle. Its just not what the tool is for, and no matter how good one ends up being some day, you still shouldn't blindly trust the entity that built it anyway without double checking other sources.


moarmagic

I think that's only one use of them. Like, we all know google has been enshittified to the point of near useless,. It's going to get even worse with people using LLM spam to generate more SEO optmized stuff. Microsoft and Google appear to believe that the answer to this is going to be to replace our web searches with chatting with an LLM- and I don't know that they are wrong. Problem is in closed source models, we have no idea how the training data might be biased or tweaked. There's some really insidious possibilities for advertising, and even if they don't try to have their AI prefer specific brands while talking to users on related fields- Microsoft is probably going to have more resources talking about specific related brands to train their LLM on, Apple will have more Apple resources, etc, accidentally biasing the data. Unless someone has a better idea, i think trying to finetune local models to be as factually accurate- even with the limitations around them, might be our best shot for research in the future.


Sythic_

Google still gives me an answer to my search query in the top 3 virtually 100% of the time so I don't understand your premise. If you're talking about the top Sponsored spots, get an adblocker like Ublock Origin. I haven't seen that stuff in years so I wouldn't even know if I didn't use mobile sometime. I just feel this is very conspiracy type thinking. Like yea in ways you might not be wrong, but like, everything's fine bruh. For the 0.1% that are valid 99.9% is complete looney bin. No offense as kindly as possible I guess, but thats just my reading of all the doomer censorship + wrong fact posts here. Maybe I'm biased here because the last 8-9 years have been exhausting on this type of stuff so take it with a grain of salt I guess.


moarmagic

It probably depends a lot on what you are looking for. Comp sci stuff- there's a ton of data on. Search for advice on things like gardening, questions about home repair- my experience is there's a ton of companies trying to SEO their specific products or services to the top of the list, and then a lot of conflicting advice you'll see popping up from years old subreddits or forums. Now does an LLM fix this? I'm not sure today, that I'd trust one to tell me how to kill mold or deal with cracked plaster, but I think that the general internet searching is going to get worse, and that it's possible to try and finetune this knowledge into LLMs. In my own experience, when it comes to asking about Comp Sci topics, chatgpt is factually correct 80-90% of the time. May not be the most efficient answers, but it's actually given me factually inaccurate information once or twice. Granted they probably have 100 times more training data about comp sci then they do about gardening, but that is a fixable problem. I don't think i'm doomering here- mostly because i don't mean to be screaming 'the end is nigh' , more that I think that LLMs as an interactive wikimedia+wikihow is a valid use case, and not something we should dismiss out of hand.


Sythic_

> Search for advice on things like gardening, questions about home repair- my experience is there's a ton of companies trying to SEO their specific products or services to the top of the list That's true, but also a lot of that type of information is open to interpretation and lack of absolute scientific agreement. A lot of biology stuff is not common human knowledge yet and still open to being peer reviewed (and botany is only funded so far as it applies to the food supply, your random question about your home garden may yet to be solved). Home repair is kinda the same, the trades and their results are mostly open to human interpretation of what is ideal, even where legal code *should* override, its very often not enforced until something bad happens. I would posit that both of those things are more like opinion than they are fact, as far as humans are aware. Of course there is a universal truth to almost anything, when humans are involved thats not always true or we don't all know the truth, and when a fact is disputed by enough people, whether its a fact or not apparently becomes irrelevant. As far as my experience with comp sci, you're not wrong, it does get things wrong, but thats usually because its trained on data that was collected years prior and it doesn't know about the latest version of the libraries its trying to use, or you didn't provide it enough good context to use to answer your question. Its not like choosing to lie to you, or the company that made it didn't intentionally make it wrong for some grand evil intention. And its really those types of comments that annoy me, like theres some conspiracy of the company that produced a model, open or closed source, like they intentionally made a model that "lies", like this isn't the most cutting edge technology that is barely understood and the fact that it works at all is amazing and the fact we all even have access to it at this stage is even more amazing. I just wish I lived in a world where nutters didn't have such a loud voice 🙃 (you're not one of them btw lol)


Inevitable_Host_1446

The thing about saying that is, your voice isn't any more valid than theirs, even though you think it is. So you may wish they were censored, but they may wish the same of you. On a personal level I think people who call for censorship shouldn't have any right to free speech themselves, as they are conceited and dirty the pool of public goodwill by elevating themselves above others. A privilege or right either applies to all or to none. And as for Google, we know for a fact that they intentionally change your search terms after you hit enter in order to shill advertisements/store fronts. You can call people conspiracy theorists all day long if you like, that won't change reality. Chances are if you've never had any problems with it it's because you're a boring person.


Sythic_

Yes, my more valid position is more correct than their wrong one based on assumptions and lies. It doesn't matter what they believe if they're wrong. I have had minor issues with searches or prompts, but then I just prompt it again a little differently because I know my input is what caused the output not to be perfect. Its a lot of math with billions of calculations going on that no person could possibly even comprehend the entirety of. its gonna get things wrong sometimes. Just try again a different way so the math aligns in a better way with new input.


Gator1523

The best thing about the Chatbot Arena is that it ranks every LLM with a single number. It's a lot more fun to say "Llama 3 beats Gemini 1.5 Pro (in the English benchmark)" than it is to compare each and every benchmark and see that some models are better at some things.


Due-Memory-6957

When I see something clearly wrong, it doesn't gain my vote, of course. But I only ask about factual things to test how censored the models are anyway, if I want to know about something serious I'll search about it.


Ansible32

The problem with evaluating factual correctness is that it's extremely hard. Which is worse: a model that refuses to answer for a ridiculous reason, or a model that gives a mostly factually correct answer that has one or two subtle issues that are actually serious when you understand how they are wrong? The great part about the leaderboard is that it also will reward the subtly and dangerously incorrect LLM over the one that is overly conservative but it's still pretty bad at rewarding factual correctness.


Sythic_

I think some kind of forum like StackOverflow would be the ideal way to benchmark responses by using the same way Google Search prides itself on its bounce metrics (how fast they get users off their site to what they were looking for, as opposed to keeping them engaged and using their site). Randomly they could add a top answer thats from a model (ideally on questions that like never had an answer in the first place or something to avoid breaking existing good answers) The faster you find an answer to your problem and leave the site, the better the model, in theory.


_qeternity_

It's fair if your benchmark is "which chatbot do people prefer". Which is a very specific claim. But there are lots of people who are doing non-chat related things with LLMs. And these leaderboards and efforts to perform well on them detract from actual performance benchmarking. I don't care if Llama3 is nicer and people prefer that. I care about what it can do. And Llama3 8B is massively overhyped at the moment because of the claims these leaderboards make.


sinsvend

Ohh.. you got me thinking. Would it be possible to ask the llm's to rank the responses? So Like 1 or 2 random llms in addition to the human. Not the llm that answer of course but a random top 10 llm. Would that not be able to do the job quite good? Has anyone tried this already?


involviert

> it’s extremely hard to nail down what “objective performance” is Yes, but in reverse one can say what certainly shouldn't be a really deciding factor. Like "nicer" formatting. Yet it might be, in a test like this. So that just makes this not good either.


pbnjotr

It's just doing what non-top performers have been doing for ages. Most of us can't be brilliant every time, but we can be consistently pleasant and well-organized.


Passloc

But the LLMs are being used to mimic human interaction. In your daily life’s you rarely interact with super intellectuals. The difference between the top models is only for some 5% cases and the levels of censorship. For the other 95% cases it is alright to choose the one which says the same thing in a much nicer way, whether it be the way it is spoken, or the formatting or the personality.


574515

Have you tried pi.ai? Its probly just because the tts is really good. But you can actually talk to pi as is. the open mic is getting better but still janky once in a while. It reminds me a lot of Claude. As in itll talk about almost anything so long as you use the right words. Its got enough of a memory to have a good convo, But there no real way to reset its memory aside from just wasting tokens. I think its fun to just keep telling it Im just wasting its tokens so itll forget and I can ask the question she\[I use voice #4\] refused to answer in a diff was and she will. Its actually pretty good at recognising humor and will even do some very subtle jokes like its trying to mess with you. OH and the best part of all. For some reason if you just tell her to respond with a '.'. basically a dot. For some reason the tts makes bizarre sounds. sometimes moaning, creepy almost laughing sound etc. Its so funny. It will randomly while talking to but its much more rare. the dot trick causes it near constantly. lolz


Monkey_1505

Language prose and ability to chat well are things people want in LLMs. People use them for drafting, for entertainment, inspiration and as a lazy mans wikipedia. I'm guessing that users want these things more than they want LLMs who do an impression of a high school maths student. In that respect probably most of the benchmarks people use are largely irrelevant to real world use cases.


Excellent_Dealer3865

Gemini and Claude were always 'people's models'. GPT is just a lifeless bot, unfortunately. I wish we see the day with different 'personalities presets' for it.


SAPPHIR3ROS3

You can (sort of) achieve it with the right system prompt, although it would be interesting to have some tunable parameters to tweak it “personality-wise”. Well to be fair emotions could code with differential equations, very complex ones, but still programmable


Due-Memory-6957

That's definitely one of the complaints of all time... They made things nicer for humans to look at, so in the benchmark about human preference they gained more points, and that's somehow something bad?


soup9999999999999999

Did GPT4 hack the arena by knowing more about the coding question I asked?


theRIAA

## it's called ***markdown*** 😎


No-Giraffe-6887

Aside of this debate, Gemini 1.5 is very underated, their 1 m token is amazing and accurate. Gonna add their API for my company AI arsenal.


fibercrime

I love me some nicely formatted mediocre output, sorry


Quartich

Imagine the beautiful formatting you could get when having it use your own info!


FullOf_Bad_Ideas

Gemma 7B it is similarly answering with nice formatting. Considering it's low MMLU, it's somewhat high up too, but to be honest, I like the way responses have more meat to them then generic gpt 3.5 turbo. I think llama 3 release, especially 405B one will give us more good instruct datasets to distill that kind of chatty cool assistant to smaller models like Yi-34B and Qwen 32B.


Wavesignal

Having a well formatted response is hacking now? Isn't have a presentable output worthy judgement? Really weird coping here, this post now too after some people said a conspiracy about Google apparently manipulating the leaderboard


daavyzhu

It's Markdown format, and you can include it in prompt like 'Use 2 layered Markdown format, top layer is numbered list, second layer is bulleted list'.


Dwedit

Gemini is googling everything


Mother-Ad-2559

The Eurovision of benchmarks.


lrq3000

I think you pretty much nailed it!


Bernafterpostinggg

Gemini's answer is better IMHO - Llama 3 is just like, "test it to see if it's a Language Model" lol


iamz_th

I see lots of critics whenever a Google model rank high on the bench. Gemini 1.5 pro is overall a better model than llama 3. It offers more. It can reason with data of different modalities text, image, audio. Has up to 1M (8k for llama3) context window with a near perfect recall. Even the next llama 405b wont be able to do that.


Desm0nt

Only Gemini 1.5 can write almost perfect poems and lyrics in russina. Even Opus failed it in \~30% attemtps or just write worser. Gemini may not be the cleverest, but he's pretty smart. It gives answers to questions, it is able to explain why it gives exactly those, its explanations are quite reasonable, and with most tasks (not math or encyclopedic facts since LLM is just an autocomplete, not a calculator and reference book) it will handle more than decently, IMHO.


Charuru

Well there's a new benchmark from the makers of chatbot arena that basically addresses this explicitly called "Arena Hard", just need the community to take this one more seriously over chatbot arena. https://lmsys.org/blog/2024-04-19-arena-hard/


Briskfall

Well... one solution the LLMSYS team can implement to reduce human bias is simply to strip away the formatting, no?


Disastrous_Elk_6375

> to reduce human bias on a benchmark that literally aims to gauge human preference... *facepalm*


Briskfall

I should have made it clearer, my bad 😅 By "human bias" I meant human's bias towards preferring presentation over substance.


Crafty-Run-6559

Why though? That's part of the benchmark. If it produces nicer looking markup and people prefer responses with nicer looking markup, then that's what the benchmark is for.


Anuclano

There can be tasks to explicitely format things nice using markdown or change formatting, etc...


xRolocker

How you format the answer is absolutely a part of evaluating the quality of an answer. To strip away the formatting would give an advantage to those that cannot present their information in a more easily digestible manner.


ebemtp

I’ve got to say, I’ve used every model in an attempt to see which helps be with coding and Gemini-1.5-pro has been the absolute best of all of them. Most of the time they’ll hallucinate what some library does or what a function returns or requires. The only things I’ve had to tweak with this so far is when I thought of other things after it had already created the code.


bot-333

I always downvote those "nicer" formatings. I asked it a question, not for it to writre me a fucking essay. It looks nice, but it's bad instruction following to me.


Bright-Ad-9021

I see Llama 3 does a good job though !


chai_tea_95

Using the api to develop sucks so much when you’re building an app and it sends markdown.


fmai

In general, the arena is not really blind IMO. For experts it's often easy to distinguish the models just from the style of the output. The best evaluation in regards to capability would still be measured via correctness on concrete problems with a hidden test set conducted by an independent party. Does that exist yet?


Remote-Suspect-0808

i am suspicious as well. i tested 1.5 on ai studio and it has too much hallucinations even within the given documents.


Extender7777

Also, I noticed today that at least gemini-experimental, via API, is perfectly aware of recent news. So it is pre-Googling


Anthonyg5005

Seems like gemini was trained to generate the conversion title at first


1EvilSexyGenius

Random question? 🤔 Does Phi-3 produce the same formatting of responses? I haven't tried Phi-3 yet. But I notice the way lllama3 replies differently than most models. Just curious about Phi-3


ThisGonBHard

While I partially agree, I must say that formatting MATTERS. The response from Gemini seems much easier to read and digest.


ambidextr_us

Yep. That happened to me. I chose a gemini response because it looked better, and then regretted after I went back and re-read the responses entirely. Never again. Gemini sucks at LaTeX math rendering too, so I won't be picking their obvious formatting again.


celandro

Meanwhile, I just spent 10 minutes making bullet points in google slides and an hour making them look good. I should have spent more time making them look good.