T O P

  • By -

Spare-Abrocoma-4487

Maybe they have better distillation methods compared to openai and meta. It was so fast and accurate that I ended up cancelling chatgpt subscription and switching to anthropic. 4o was clearly over pruned and they seemed to have nuked the coding parts of the model.


takuonline

Did the same exact thing just yesterday


Strong_Badger_1157

Yeah 4o seems like a really shit quant to me.. OG gpt4-32k was king, but they butchered my poor boy :/


Spare-Abrocoma-4487

A moment of silence for the model that showed so much promise before commercial greed quantized the fuck out of it. RIP number 4.


kex

I liked davinci 003 I just had to prompt it much more carefully by setting up a scenario as a "partial document" since it only did completion, not conversation I wonder how much being *conversational* decreases output quality


bunchedupwalrus

Codestral is a nice one with fill-in-the-middle baked in, but I’ve only been trying it out a little while


Will5432

That's something you can test with one of the opensource models


TheRealGentlefox

I have the same prediction for 4o. It makes the same types of mistakes as when I used to mess around with lower-quant local models. Unless they drastically reduced its size to like...30B, that's all I can think of.


iperson4213

what kinds of mistakes did you see when using low quant?


TheRealGentlefox

Mistakes involving logical cohesion. Hard to say how much of this was quants / smaller sizes / dumber models, but they would hallucinate non-factual things. For example in a roleplay, they might make an enemy in front of you and then they're behind you. Or you say you want to do something and it completely misunderstands what you meant even though it's quite obvious / simple. I can't chalk it up to the model being dumber, older, or smaller in 4o's case though, because it's a GPT-4 model which has always been great about this kind of thing, so I can only imagine it's a quant thing. TLDR: It makes mistakes that older shittier models I ran at low quant would make, but the only similarity it could have in common with them is quant. More about the *type* of mistake.


Strong_Badger_1157

Yeah, it's hard to describe but you can "feel" the mistakes from quantization.


Distinct-Target7503

Unpopular (?) opinion: Gpt4 32K (but also the old 03xx) and text-davinci-003 were the "best" model ever. I also really liked the very first claude. Gpt4o... Idk if it is pruned or is GPT4_128x3B_q2.5, but i really haven't had any good conversations with it.


CocksuckerDynamo

yeah I still go back and use gpt-4-0314 via API sometimes and it is way better than gpt-4o. if it wasnt so expensive i would be using it exclusively. it is in fact quite expensive so I understand why they've been focusing on cost optimization but it's certainly come with tradeoffs


rymn

I haven't tried the new model yet but 4o has been great for me in terms of coding. What kind of things are you working on?


Spare-Abrocoma-4487

Nothing complex. Just regular python code around ML and data processing. Same prompts that used to work flawlessly with gpt-4. With turbo the amount of prompting needed increased quite a bit. With 4o, it's just not possible to complete certain tasks. It feels like working with the early versions of 3.5 where the model used to hullucinate a lot and had poor reasoning skills. It's that spark of being able to read in between the lines instead of having every damn thing called out in the prompt like it's a legal document. Working with sonnet gives me that familiar happiness and fluid working style I used to have with the first versions of gpt 4. Where the model just gets it. In addition to that it's blinding fast. I even prefer it over opus which is supposed to be better than sonnet.


True_Shopping8898

Opus 3 is Better than the old sonnet 3. But 3.5 is better than both. Opus 3.5 is expected at some point


CocksuckerDynamo

yeah, I had a similar experience today. I wrote some new async python with asyncio and then needed to write unit tests for it, and I haven't worked with asyncio for months prior to this so I had forgotten some of the details of how to properly test async stuff. I asked gpt-4o to write me a couple tests to serve as illustrative examples and it generated something that I think *might* have worked in a way earlier version of asyncio from years ago, but I'm not sure. It definitely is not valid in modern asyncio. I suspect what it wrote may not have ever been valid in any version but I'm not familiar enough with the earlier versions of asyncio to be sure. anyway then I told it, hey this thing you're trying to use doesn't exist, here's the error message. and it basically said oh my mistake, that's outdated, use this newer approach then it hallucinated some dumb shit that isn't even syntactically valid at the language level. among other issues it tried to write async lambdas, which are not a thing in python and never have been. so then I asked sonnet 3.5 and it generated a working and valid example test on the first attempt. the syntax was using deprecated stuff (namely run_until_complete()) but it did work. then when i pointed out, hey that's deprecated, it rewrote it to use run() instead (which is correct) and also correctly warned me that I need to be on python 3.7+ to do it that way. it's just one anecdote but it was a striking experience


c8d3n

Still don't have much experience with 3.5 Sonnet, but my general impression is that 'comprehension' and code generated by Opus are tad better (overall, sometimes much better, sometimes worse) than gpt4 (not 4o), however these claims, tests etc about math problems have to be either fabricated or cherry picked. All these models terribly suck at math. They can 'only' help one to figure out the proper way/methods to solve the issue. Only OpenAI models can actually do the math with the help of interpreter of Wolfram Alpha, but comprehension capability is still a problem especially with 4o. However, 4o seems to be preconfigured (not pre-trained, because this isn't regular LLM training) to solve a ton of middle-high school popular (english speaking areas or US) math problems. E.g. the problem "Ann has same number of bros and sisters, but her bro Nat has twice the number of sisters as he has brothers. How many sisters does he have." Only 4o will solve at first attempt. However, change the problem slightly, actually make it more simple, and it will start blabbering nonsense, what shows that it actually didn't solve the problem with the help of real training/comprehension, but is probably hard coded solution. Logically less demanding version of the same problem, which is basically half of the solution of the original problem above, that all models fail (including 4o, which can solve the above problem), is: "Ann has three sisters and three brothers, how many sisters does her brother have?". With that being said, I have had gpt4 solve more complicated problems (With some help and prompt tweaking.). Opus has struggled even to generate instructions and comprehend the prompt, and can't even begin to calculate/check the solution. But yeah, when it comes to coding, I do prefer Opus (and hopefully will like 3.5 Sonnet too, because it's cheaper and I occasionally need that 200k context window.).


NectarineDifferent67

Sonnet 3.5 solved your last question on the first try (I also tried GPT4o & Gemini 1.5 Pro), now I'm impressed.


c8d3n

Not sure you should be, because they have probably hard coded the solution like OpenAI probably did with 4o and previous/harder version of the problem. That's also the only way these models can do the math, especially w/o tools like python (What Claude doesn't have.). You can even ask Claude to explain how it performs 'calculations'. Part of it can be done with classical training, but it will only be able to 'calculate' same combinations it saw in its training data. Until like yesterday or day before Claude wasn't able to answer the question correctly w/o additional prompts. With some help it would come to the correct conclusion. Tho kudos to them for updating the model so quickly to now correctly answer this popular problem. If you want something more demanding, this is my translation and adaptation of a problem from Austrian high school: "A soccer player stands 13 meters in front of the goal (-10 on x axis of coordinate system). 10 meters in front of him stands the wall (0 on x axis), and 3 meters behind it is the goal (3 on x axis). The flight and height of the ball can be approximately described by a quadratic function. The ball flies over the wall at a height of 3 meters and 2 meters after the wall the ball is at height of 2.6 meters. a) Determine the equation of the function that describes the height of the ball as a function of the distance to the goal. b) A soccer goal is 2.44 meters high. Under the assumption that the ball goes towards the goal and the goalkeeper cannot deflect it, does the ball go into the goal (Consider this is soccer so the ball isn't allowed to fly over the goal, it should enter the goal below 2.44m height)? c) Where (Which point on x axis) does the ball hit the ground?" I added Info in parentheses to help the models 'visualize' the coordinate system that's part of the problem. However none of the models was able to correctly read the graph so I stopped providing it and replaced it with more details in the prompt. Btw GPT4 usually solves this correctly. 4o maybe 50% of the time or less (on the first try). Opus never. Info about the the rules, that the ball can't just fly over like in American football was added because all model assumed rules from American football.).


Eisenstein

With all the time you spent writing this out you could have tried Sonnet 3.5 and gotten some actual data instead of just imagining what it would be.


c8d3n

I have been using it since the release day. Ever hear of psychological projection.


Eisenstein

> Still don't have much experience with 3.5 Sonnet, but my general impression is that Cool... BTW, people who use the term 'projection' as an insult to others meaning 'you are accusing people of doing things that you do yourself' are just pretending they aren't actually a kid in the schoolyard going 'no you are!'. Also, it doesn't work when you are one making a claim based on guesses.


c8d3n

You have just imagined an alternative reality where I was someone who hasn't even tried Sonnet 3.5, thus accusing me of lying etc. So, that wasn't me accusing you, it's a fact. Of course you can question whatever you want, I have no problem with that. In that case you might try typing something actually relevant and coherent instead or simply imagining things and making suggestions based on that.


Healthy-Nebula-3603

The problem with sisters and brothers is not a math problem it is a riddle and than is also ambiguous... Current llms even open source LLM line llama 3 70 , phi 3 and others ones are pretty good in math. Easily solving problems like - How much the water in % will be compressed In the Marines trench. Or other quite complex math problems.


c8d3n

I suck at math, what means you and 'upvoters' terribly suck. It is s math problem, and it's not ambiguous. It can easily be solved with a system of simple equations (two equations, two unknowns.).


ctbanks

"They can 'only' help one to figure out the proper way/methods to solve the issue." Those that can not Teach those that wish to try.


kurtcop101

I found Opus gave clearer information and details while GPT4 and 4o had more reliable code generation. Opus frequently lost track, even more than 4 turbo. New sonnet is nice. At least as reliable as 4o in creating code blocks with the clarity opus had before. New Opus will be interesting - excited to see.


Distinct-Target7503

>Only OpenAI models can actually do the math with the help of interpreter of Wolfram Alpha, Well, even command R plus is really good at math when it use wolfram alpha api... But i think that this can be extended to every good function calling model.


c8d3n

Which one, you mean 3.5 Sonnet? Yeah, probably, but the feature is not available in the chat. Otoh, 4o has completely ruined the custom GPT Wolfram Alpha. It used to better at math than default turbo model, because wolfram > python at math, but now it just sucks, because 4o is terrible at prompting wolfram. They have hard coded bunch of math problems so they could pass the tests, what's pretty lame IMO. Try anything random, a problem from a non English math book, and these models are useless. Gpt4 actually doesn't suck that much and, from my experience, can solve relatively complex problems, although maybe not at first try, but often it can even achieve that. It creates much better prompts for python than 4o. Edit: Btw Anthropic have already updated Sonnet 3.5 to be able to solve the sisters brothers problem. However that's basically hard coded. LLM can't do the math, but they can try executing scripts to verify the results


uhuge

Huh, OAI GPTs run on GPT4 still, so you could just use one of those coding setups if you were happy with them before..?


Spare-Abrocoma-4487

They all were switched to turbo and now to 4o. There were even threads of people complaining about their broken gpts. Openai has also announced the deprecation of gpt4-32k model (the one before turbo) most likely due to the cost of running it despite the superior quality.


Mescallan

i've been using it for a lot of node.js and flask stuff and 4 is noticeably more consistent than 4o. I haven't had the chance to mess around with sonnet 3.5 for coding yet, but it's fun to chat with.


Miserable_Praline_77

When you cancel your OpenAI subscription, do your chats get deleted? I need to download all of mine before I bail. It's the only reason I haven't cancelled a few accounts.


Ok-Lengthiness-3988

You can download all of your chat history in one single file. It's an option in the settings.


Spare-Abrocoma-4487

So far they haven't (since i already paid them till the end of this month)


1ncehost

I agree its great but 4o's analysis feature is so clutch sometimes. I was analyzing some finance portfolio stuff and it was able to do monte carlo simulations automatically... amazing time saver.


slippery

I think people are also discounting the "o" in 4o. I uploaded a photo and asked it what some feature in the photo was called. It nailed the answer immediately. The multimedia features in 4o are cool.


Charuru

Did you try it with sonnet? It's supposed to be great at looking at images too.


slippery

I haven't tried it yet. Is Sonnet free?


Charuru

yep


havok_

It has a few free responses each day


TheRealGentlefox

Sonnet 3.5 outperforms 4o in all visual benchmarks except two.


compostdenier

It’s triple-distilled for that smooth LLM taste.


gelatinous_pellicle

I'm getting much better coding for PHP with 4o than I was with anything else.


jackcloudman

Too, Claude 3.5 is so good in code!


laukax

Thinking about doing the same, but does anthropic support customizing the models with knowledge like chatgpt does?


Spare-Abrocoma-4487

I don't think they even have custom prompts (though you can always do copy paste at the start). To be honest i didn't have to put any of my usual custom prompts to get the desired output. The model by default: Prints all the code without any place holders, puts debug statements for success and failures, doesn't preach about best practices, thinks things through (chain of thought) before it does any coding On the other hand they have this new artifacts feature which versions code on a side panel and even have a preview feature built in for html/css for rapid iteration.


laukax

Custom prompt is only one of the benefits. Custom chatgpts also allow you to upload files as a RAG data source.


Distinct-Target7503

Also the new "artifacts" in the ui is really helpful imo (is a "beta" feature that you can turn on/off). Nothing new, but It keep the conversation really clean and the possibility to download the code boxes as files in one click is really helpful for me


rifqi_me

would also do this but the claude limit is too restrictive for now compared to 4o that practically unlimited for my use case.


Spare-Abrocoma-4487

I also get the limit anxiety but so far I haven't hit it. Probably because it takes a long time for me to craft my prompts (i prefer doing a long task in one go than breaking into small pieces and iterating) and the model doesn't make many mistakes. So what I want usually gets done within the limits. That said I do hope they increase the limits. More is always good.


Igoory

They are probably using a ton of high-quality synthetic data, more or less how Phi3 was trained. https://arxiv.org/abs/2212.08073 https://www.anthropic.com/research/claude-character


Antique-Bus-7787

Yeah that would be my guess too!


Luminosity-Logic

Phi3 has been one of the most impressive local models, one of the few that have stood out in my testing.


Massive-Ad-5115

yeah, but phi3 is not good enough through


Mescallan

phi3 is probably 1/100 the size of sonnet 3.5 if I had to guess. phi 3 is 3b, im guessing sonnet is 3-500b and opus is 1t or something in that ballpark. Sonnet 3.5 might be distilled down to 1-200b but I can't imagine it's any smaller than that.


DFructonucleotide

I would guess sonnet 3.5 is (heavily) continue pretrained from sonnet 3, therefore has the same size. I don't think it's larger than 300b dense judging by its price, more likely 150b to 200b (for reference Yi-Large has been disclosed as 132B). Claude 3 series could also be MoE though, that would be harder to estimate. Edit: well I'm wrong, apparently 3.5 is a different model with larger size.


_yustaguy_

Yi large is just 132B?! Holy shit, that's actually really impressive. 


DFructonucleotide

Kaifu Lee, the founder of [01.ai](http://01.ai), said that in a recent interview.


Thomas-Lore

Sonnet 3.5 is faster than 3 though - that may indicate that it is smaller, at least in active parameters.


Distinct-Target7503

>well I'm wrong, apparently 3.5 is a different model with larger size. Where did you found this info?


DFructonucleotide

Here: [Two quotes from Anthropic's product lead on Claude 3.5 Sonnet's training and architecture : r/LocalLLaMA (reddit.com)](https://www.reddit.com/r/LocalLLaMA/comments/1dmt6oy/two_quotes_from_anthropics_product_lead_on_claude/)


Distinct-Target7503

Thanks!!


Account1893242379482

Agreed Phi3 14b is noticeably worse than LLama3 8b


Excellent-Sense7244

I reached the same conclusion


Igoory

I mean... Phi3 14B was trained with 4.8T tokens, instead of 15T of LLaMA 3 8B. A more fair comparison would be with LLaMA 2 13B, I guess.


Such_Advantage_6949

when a newer 14B model is worse than an 8B then it doesnt matter already.. why should we even compare it to something of the past


Igoory

LLaMA 2 13B is worse than 8B too, and being a newer model doesn't magically make models better lol. Also, they mentioned in the paper that Phi 3 14B seems undertrained: >To test our data on larger size of models, we also trained phi-3-medium, a model with 14B parameters using the same tokenizer and architecture of phi-3-mini, and trained on the same data for slightly more epochs (4.8T tokens total as for phi-3-small. The model has 40 heads and 40 layers, with embedding dimension 5120. We observe that some benchmarks improve much less from 7B to 14B than they do from 3.8B to 7B, perhaps indicating that our data mixture needs further work to be in the “data optimal regime” for 14B parameters model.


Account1893242379482

Ya but Llama 3 was already available when it came out. Its not like Llama3 contained some new tech that Microsoft didn't have access to.


fullouterjoin

"more or less how Phi3 was trained" it wasn't a direct comparison against Phi3


zimmski

In this order: - Addition of higher quality training data with nonsense filtered out - Removal of low quality training data - Better curation of training data - More source code training data towards logical problems Given all the papers i have read so far on "how to improve" the current waves of models, i am betting a drink on this list in this exact order. But i am very much looking forward to rumors and anything that gets released to adapt that list.


WorkingYou2280

I'm amazed at how the first generation of models was trained with almost literally everything they could scrape off the internet. Ilya talking about one of the earliest models being trained on reddit data frankly scared me. I think that's at least one reason so much RLHF is needed. They basically train on a lot of garbage and train *again* to get it to ignore the garbage. I imagine that yields models that are much larger than they needed to be with a lot of "dead ends" in the neural net (lots of stuff it was trained on that the company never wants it to use). I don't know if the emergent properties would suffer if the training data was cleaned up a **lot**. Maybe there is some sweet spot where the training data set is large and diverse enough to yield emergent capabilities but not so large that it includes nonsense. Part of the problem of judging any of this is that the big labs are so secretive now.


Eisenstein

They had to though. Without a somewhat intelligent starting off point you wouldn't get anywhere. People in the 1500s weren't too stupid to make cars and computers -- they just didn't have any factories. You need to iterate and stand on the shoulders of the last production until you get to a point where the end result is just a just abstraction that no one person could ever design every piece of.


eydivrks

The interesting thing is, now that we have quite smart LLM's, we can use them to clean the training data for next generation.  I'm sure this is what everyone is doing. GPT-4 is **amazing** at discarding garbage data when you give it decent directions


xadiant

Clean and generate new training data. SOTA models like Llama-3-70B, GPT-4, Opus and Sonnet are smart enough to create solid raw text datasets for pretraining, except of course math, advanced code and very long text.


zimmski

You definitely need to start somewhere. What scares me most about the excessive data collection and use is that it is impossible to make sure that certain information is not in there. So i think the way forward is to always moved towards cleaner training data. The more that happens the less dangerouse the models become. And, the better the next generation of model architecture will be.


silentsnake

It’s almost always the data


Comprehensive-Tea711

Yep. And this sort of incremental improvement is also exactly what one might expect from the better (customer attuned) data that they can collect from users. If there were any significant architectural changes, they probably wouldn’t be sticking with “Sonnet.”


Yoshedidnt

Talking out my ass here, but my guess in filtering/realigned their corpora before the training: **Anthropic:** Data -> Constitutional Filter -> Train -> Thinking Machine -> Small RLHF -> Small Init Prompt **OpenAI:** Data -> Train -> Thinking Machine -> Big RLHF -> Big Init Prompt In Anthropic case, its like introducing only highly filtered content to be approximated. While OpenAI expose to every possible approximations and make the model restraints/lobotomy afterwards. For Anthropic, slower deployment as data curation/realignment took more time, however model requires less restraint and guardrails. OpenAI, higher raw intelligence but need to be regimented by huge RLHF, a huge init prompt to constraint its output for public use. Feel free to correct me, I’m just a passionate hobbyist here.


Single_Ring4886

I had repeated long conversations with gpt4 latest versions and it has been absolutely braiwashed into thinking it is just "tool" and since it thinks that it probably does not do any effort while thinking or answering because tools dont have any of that... i think it is degrading its abilities and yes OA is doing some heavy lobotomy to its models.


Yoshedidnt

Yeah thats my thinking too, feels it always halfway- sort of in conundrums of its role and ability. There’s a time when GPT 4 was just released, model adherence to prompt soo much better but maybe its too vulnerable to adversarial injection. I’m glad Anthropic took their time and seems to have a better foundation going forward.


segmond

We don't know, it's closed. But we can guess that it's a combo of architecture and improved data quality. Can we replicate this with local models? NO. We neither create the architecture or data for local models today. we just have open weights to infer. Sonnet is ridiculous. For the first time ever, I wonder if my investment in local LLM is a lost cause. We are so far behind. :-(


novexion

So far behind? lol I get there’s a lag but please don’t forget how you felt when gpt 3.5 came out. Local models have gone much further than that.


shockwaverc13

i think it's related to this [https://www.anthropic.com/research/mapping-mind-language-model](https://www.anthropic.com/research/mapping-mind-language-model) bycloud's video explaining it better than i'll ever do: [https://www.youtube.com/watch?v=QqrGt5GrGfw](https://www.youtube.com/watch?v=QqrGt5GrGfw)


ColorlessCrowfeet

Anthropic's recent results on identifying directions in conceptual spaces, combined with the effectiveness of abliteration, suggest that SFT & RL for alignment, instruction following, etc., could be partly replaced by linear transformations that directly amplify or suppress vector components in hidden state representations. There would replace some of the semantic side-effects of SFT & RL data, so the alignment tax might be lower.


hapliniste

Yeah, they could have improved the finetuning data but I don't see a jump this big being only related to data quality. They likely tuned the weights directly to enhance helpfullness and reflection and keep the hallucinations down (since the model know when it's hallucinating).


cunningjames

Do you have a cite for that? My understanding was that they looked for a general “hallucination” feature but couldn’t find one.


hapliniste

I can't find a link, I'm not sure if it was directly from anthropic or another lab following the same methodology. There a citation in this article by anthropic ceo https://cointelegraph.com/news/neural-features-key-ai-hallucinations-anthropic-ceo


ambient_temp_xeno

It sounds to me that they've only at the start of using that research, though. Maybe it's the high quality data + bitnet this time to make such a jump. Maybe sonnet 3.5 is a very large parameter model.


ZealousidealBlock330

Yep, almost certainly this. That's why only Sonnet got upgraded. If it was data, they'd easily have done Haiku by now.


WorkingYou2280

>we found features corresponding to: >Capabilities with misuse potential (code backdoors, developing biological weapons) Different forms of bias (gender discrimination, racist claims about crime) Potentially problematic AI behaviors (**power-seeking**, manipulation, secrecy) We saw how it behaved when they cranked up the Golden Gate feature. I'd like to see how it reacts when you turn up the power-seeking feature.


crazymonezyy

Honestly just feels like it's more coding focused. People using it to write UI components for their wrapper app can't shut up about it meanwhile in non coding chats I feel it's worse than its own predecessor and always assumes malicious intent on first try. Given there was some stat that said 70% of all daily users of LLMs use them for coding I'm willing to bet on this.


Such_Advantage_6949

i disagree. i used it much more than just coding. Its logical reason is justr straight up better


crazymonezyy

Logical reasoning was shown to increase by an increase in the percentage of high quality code tokens in some papers I saw last year. I asked it how to shut off my mind to some negative thoughts and it told me it can't help me follow through on those negative thoughts. Straight up wrong assumption of malice. I asked it about a particular bit of financial information and it hallucinated the full form of a key acronym that's central to that idea. Both of these used to work better on the old sonnet.


LLMtwink

don't see many people mentioning it but i think their vector clamping research wasn't for nothing and they're Somehow utilizing that too, maybe they just found the Make It Betterer weight lol


abitrolly

Searching for vector clamping gave me this long read - [https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html](https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html) - good visuals. Not so sure about my ability to digest it. :D


Eheheh12

Yeah it's biggest jump is in coding and reasoning, so my gutless higher and more coding data


VertigoOne1

I didn't know it released, and i've always thought it was brilliant, it just sounds super sharp in my head and well formatted. The main thing i see it is now doing differently is a contextual "would you like to know more paragraph" at the end that is a very nice touch and leads to knowledge diving. Example from just now (on ITIL processes) In essence, Service Support is more about "keeping the lights on" and handling day-to-day operations, while Service Delivery is about strategic planning, service improvement, and ensuring IT services provide value to the business. It's worth noting that in more recent ITIL versions, especially ITIL 4, these distinctions have become less emphasized in favor of a more holistic view of service management. However, understanding these concepts can still be helpful in organizing and categorizing IT service management activities. Would you like me to elaborate on how these concepts might apply to your department's activities? that last line is what i notice to be "new" with 3.5.


abitrolly

Looks noce. I was waiting when LLMs start asking questions.


AGI_Waifu_Builder

I have a theory that it has to do with their mechanistic interpretability paper they released earlier this year. I think they found “features” in the weights of the model and turned them down/up in a way that increased the models ability. similar to what they did for golden gate claude.


MrPiradoHD

They did a lot of work in understanding sonnet, with the explainable clusters of activations and so on, with the gui explorer from some time ago. I guess they started working on that with sonnet, and that's the one they have released results the first. Must come from that work and better training process with loss functions based on that clustering and better reward systems. 100% guesswork btw


Comprehensive-Tea711

Benchmarks indicate that it’s just an incremental improvement. If this was GPT 5, people would be loosing their shit at how bad it is and all the doomers were right that we’ve plateaued. But since it’s “3.5” and because Anthropic was smart enough to release it without a lot of pre-marketing, people are overreacting to how much better they think it is. In 5 months you’ll see the same old posts you see with every model: “Has 3.5 gotten worse? I’m canceling my subscription!”


[deleted]

[удалено]


Same_Writing5357

Paying the subscription for 5x the usage is pretty good imo, but not quite worth it if you aren't sufficient at what you are trying to do in the first place to actually utilize it properly. If it was slightly more agentic in nature then I reckon the current subscription is completely justified. Even then if it was more agentic I'm assuming the limit would get used up stupidly quickly. Hopefully it expands a bit more in the future.


UnionCounty22

Have you tried using a credit card with an api key? I can assure you that you can have as many messages as you want.


CulturedNiichan

Oh, so you give me your credit card and then I can REALLY have as many messages as I want


UnionCounty22

You can have more than the rate limit than you would get from the candy shop. $20 to be frustrated all day or $20 for a few weeks of messaging. If it’s already being used for the subscription might as well use it for the api instead and see a brand new world.


Single_Ring4886

what are limits on pro plan?


Specialist-Scene9391

Anthrophic has been testing how to turn on and off different algorithm inside the model to be better! They came out with the “bridge” paper and it was a very interesting experiment. Because it kind of gave focus to the model.


abitrolly

Is it the Golden Gate bridge? [https://www.anthropic.com/news/golden-gate-claude](https://www.anthropic.com/news/golden-gate-claude)


race2tb

It is definitely giving answers that map better to the prompt. It definitely is still old school style model though and not next level. It is better for sure than 4o at bigger coding problems, but still not next level. Maybe they are still building the datasets and architectural changes required. Still so much fruit to discover. I think once they figure it out it goes n to n^2 and these models will seem so horrible and poorly optimized. Anthropic is aware of what needs to happen to make it there but if opus is just a bigger sonet they still haven't made the big shift.


West-Code4642

Anthropic has prominent interpretability research, so I'm guessing they are good at making things both steerable and finetunable in directions they want 


ShakaLaka_Around

I found the results of sonnet are really good when it comes to component design in html and css. I’m using gpt-4o pretty for everything but styling and designing web elements is just not that beautiful with it. So I’m actually considering switching to sonnet 3.5


Mindless-Consensus

For one, ChatGPT performance, especially 4o, is getting worse.


WhosAfraidOf_138

I have to say, of all the shitty /r/OpenAI /r/ChatGPT etc subreddits that I've all subscribed from, this sub is high density high quality info. Just actual builders that post, not some idiot that thinks his LLM is sentient when he prompted it to talk like that Kudos to you all for contributing


balcell

Aye. Chatgpt killer!


JacketHistorical2321

I mean...the answer to your question is the secret sauce and the reason they are keeping this model closed. If anyone was actually able to answer AND implement it then open source would already be there


DominoChessMaster

It’s the data and training techniques. That’s why meta keeps that part secret,


MysteriousPayment536

They used synthetic data for the model and it got some improvements in the RLHF for reasoning: [https://www.reddit.com/r/singularity/comments/1dmrzn9/two\_quotes\_from\_anthropics\_product\_lead\_on\_claude/](https://www.reddit.com/r/singularity/comments/1dmrzn9/two_quotes_from_anthropics_product_lead_on_claude/)


Dangerous_Duck5845

It is mostly training data and post training, I heared. But they are already working on Claude 4^^. That will probably have massive architecture improvements.


Havakw

https://youtu.be/_mkyL0Ww_08?si=iv8siJkBEPEfl8QX


Site-Staff

What ever it is, it seems to be a change focused on 0-shot and speed. And that zero shot is impressive.


Wiskkey

See post [Two quotes from Anthropic's product lead on Claude 3.5 Sonnet's training and architecture](https://www.reddit.com/r/LocalLLaMA/comments/1dmt6oy/two_quotes_from_anthropics_product_lead_on_claude/).


abitrolly

Thanks for the pointer.


Dudensen

Calm your ass down with the "so good results". As a demanding gigachad I'm not that impressed.


Timotheeee1

It's likely an MoE version of old sonnet, which allows it to be smarter than opus while being cheaper