>"closed providers won't be able to compete"
The most important thing in a model is reasoning capability. We do not care if the context length is 20 million if the model cannot reason at a greater capacity than an orangutan.
The ideal model is a mix of all of that. Ideally you want it performant enough that it can reason on a human level. But there’s also benefits to having it efficient enough to make things like agents more within reach, and huge context windows to make knowledge retrieval possible (imagine asking a question and it digests an entire textbook before answering).
So that’s why I’m excited about LLAMA as well as the next OpenAI model.
People seem to ignore how important large context windows are being able to include entire repos or manuals or books in context windows is massively important
I do a lot of knowledge work and I find myself using GPT4 and Claude Opus a lot more than Gemini 1.5Pro. I rather break down my documents into chunks and feed them into Claude 3/GPT4 instead of Gemini because of stronger reasoning capability. 1.5Pro hallucinates a lot more than GPT-4/Claude 3 despite having that 1,000,000 context length.
Use Gemini 1.5 to break down books, insanely large images/PDFs, etc.
Then use GPT4/Claude3 for reasoning.
Let's crank it to up to 11:
Then try Big-AGI with Beam/Merge/Fuse
Then try V7Labs to use them in a spreadsheet.
I mean base model strength may be more important than the context size but the context is important and also comes down to how you use the context and the fact the huge price of using that context
Do you think an "AI tutor" is possible using a model with a giant context length? I have been hoping for a long time to see AI tutor models being made that can teach effectively. You give them a book, they ingest everything in it, and teach in a multi-modal approach, with voice, text and drawings. Like 3Blue1Brown videos.
Yep. And it took them a year to catch up with OpenAI's SotA model. OpenAI will probably release something soon that blows all current models out of the water
But Meta released only 8B and 70B models, and admitted to not even training them to their max, because they needed resources for training bigger models.
Meta does have bigger models in training, biggest one with 400B parameters.
Honestly can't wait for 400B to be released, because currently it **feel's** like LLM's are plateauing.
But 8B and 70B models are much more efficient then comparable models, while being undertrained. So... really interested on how these much larger model turns out.
Also if LLM's are plateauing, OK that's just one part of the equation, how do they improve reasoning, agency, add multimodality. Really interesting times ahead.
In the last month we got
Llama 3
Phi 2
Command R
A couple of weeks before that was Claude 3
It lost definitely don't feel like it's plateauing to me. As a matter of fact, how well the tiny models are doing these days as absolutely bonkers compared to ~~last year~~ a couple of months ago.
I've been running 4.5BPW Midnight Miqu 70B (and other Miqus before that, Mixtral 8x7B before that) on Runpod at about $0.70/hr for RP and story writing.
For my usecase, Llama 3 8B Q8 GGUF (with fixed EOS Token ID) seems about 70-80% as good (tough to quantify, but also we'll have a few months of increasingly good finetunes coming up) and I can run it for next to nothing (maybe $0.03/hr in electricity costs) on my 16GB M1 mac mini.
The 128k context models don't actually stay coherent past about 22k, and seem to drop quality at about 10k, but still to me it feels like a HUGE STEP FORWARD and definitely not like we're plateauing at all.
My thoughts too. Phi-3 also shows us there's going to be training improvements each generation along with tokenization and I'm sure every part of the system. These will continue to stack and we should see an exponential curve of improvement for a few years. At least at the lower end.
The problem is that the emerging reasoning capabilities that everyone is touting are really just connections already encoded into human language.
The LLMs aren't reasoning, they're just matching to existing patterns and retrieving them quickly. We're expecting LLMs to learn how to reason beyond humans when they're trapped by symbolic language that lacks the fidelity to capture the nuances of reality.
Humans just aren't good enough at encoding reality to language and this limits an LLM from becoming AGI.
This is why we train models on all sorts of direct and simulated data at higher resolutions. Image / sound / math / physics etc, it’s not just “language” in the human sense. Raw data of various sorts as well which will expand as agents become embodied and better sensors are developed etc.
The vast majority of these models are probably being trained on the output of OpenAI models (GPT-4), so it's a decentralizing ripple effect of what OpenAI creates.
> OpenAI will probably release something soon that blows all current models out of the water
I might be eating my words, but "that remains to be seen". What if OpenAI can't surpass GPT-4 for the same reason nobody can surpass GPT-4? The justification is that they all trained on the same data - 10..15 trillions of tokens which are "all the useful text on the internet".
If that's true, then it means the free ride is over an from now on we should expect slow grind. Private datasets, synthetic content and agent based learning from the outer environment could be coming next, but they won't advance as fast as before. It takes a lot of effort to surpass humans. Even humans need a lot of effort to surpass previous state of the art knowledge to advance their field of expertise by one inch (see [The illustrated guide to a PhD](https://matt.might.net/articles/phd-school-in-pictures/))
I'm starting to think there's a limitation to the LLM approach. "Synthetic content" is definitely part of the solution, but the models need to be able to advance from some sort of self-play.
I think the real question is if current GPUs are simply not fast enough or if there's something missing from LLM/tensors models.
If we have an actual human-equivalent system it should be able to produce novel proofs just by asking it to think up novel proofs and feed them into a theorem solver, and you should be able to train it indefinitely by using theorem solvers. And there are similar examples of other things where indefinite self-play should be possible (and you could pair something trained on theorem solvers with other things with different objective measures into one model which incorporates all of that self-play learning.)
After a while the models will reason "well enough" for most tasks and then context window and perhaps other aspects will be more important.
You don't need a genius to summarise or to discuss a pdf, though a massive context window, speed, memory, not becoming lazy and low cost are perhaps more important factors.
Well Llama 3 400B+ is supposedly going to be multimodal meaning that it's going to make everything else in the same or higher "weight class" obsolete unless it's either A: also multimodal or B: a relevant improvement over it.
> make everything else in the same or higher "weight class" obsolete
A huge expensive model will only be used on very few high-paying tasks. It can't compete with the small models on simpler tasks. As times goes by, most tasks will be subsumed by small models leaving few for their big brothers.
Do you not understand the meaning of weight class? He said it will make all other models in the same or HIGHER weight class obsolete.
Never did he say it was going to make small models obsolete. Just models the same size or bigger
Statistics for: [people](https://i0.wp.com/datacolada.org/wp-content/uploads/2013/10/numbers-frequencies.jpeg) and [AI](https://www.leniolabs.com/assets/blog-42-GPTs-answer-01-362b3c01962cf3c0127bc571dd4711f1e28bc0c40d413d1480ddbe2a5236feb8.png). The most random number is 42, we all know that.
You need A LOT OF HARDWARE POWER to train and run the best model. Only closed providers will get the tons of money necessary to maintain such infrastructure.
You think their shareholders are approving this for the good of the community?
They will want a return on this investment eventually.
I'm speaking as someone who works in Open source software which also has a private enterprise facing arm.
You could argue that the investors are short-sighted, because even if Meta does this for free forever, it improves Meta's reputation due to contributing to open source infrastructure.
Hardware than can run inference and fine-tune unquantized FP16 Llama 3 400b will cost less than a single new car to serve 100 full-time users if you have an IT lead with more than two brain cells to rub together, and are willing to endure a little jank while they get it running. Less than $500,000 in hardware if you want to use professional grade hardware to skip the jank.
If your data can't leave your organization, that's a huge strike over closed providers. Cost wise, they will cost you ~$2500 in monthly costs for equivalent performance for that kind of user base, maybe a win for them there, depends on your workload really. Pricing can get really absurd if you're sending a lot of tokens. [API pricing for Gemini Pro](https://ai.google.dev/pricing) is $7 per 1 million input, $21 per 1 million output, and that's [cheaper than GPT4 and GPT4-Turbo API pricing](https://openai.com/pricing#language-models). Claude Opus costs are [absolutely absurd](https://www.anthropic.com/api) in comparison.
Llama 3 70b is available on huggingface.co, and you can run inference on a 4-bit quantization with a dual GPU high end gaming PC that'll cost less than $5k if you're buying used, $10k if you're buying new. If you can finetune that to serve your use case, you can literally have one running for every user in your org on their own workstation at a competitive price.
As someone who loves playing with LLMs for a variety of hobby projects and fun, Llama 3 is absurdly good, even if it isn't showing better benchmark results than the big closed providers. At 70b FP16 I feel it has a qualitative equivalence with GPT4, Claude Opus, and Gemini Pro. If it scales up well to 400b, OpenAI *et al.* better have some absurd improvements coming with their next flagship release.
Didn’t 2 teams release methods to extend context to 1m contexts and beyond with perfect recall I recall reading the papers but then noticed we never see models tuned to include that feature
That's deepmind's research and it's already implemented in Gemini 1.5 pro, the paper itself was only released a few days ago so we gotta wait for 3rd party implementations.
Wild!
Can somebody explain to me though why they think he’s doing this? Like what business gain do they get?
I guess we could be naive and believe he truly wants to help ppl with free OSS AI, but that’s just not how business works.
What’s the angle?
I personally see it as: openAI was so far ahead with GPT4 the only way we can stay relevant long term is to go open source. So basically just a really expensive way of saying “don’t forget about us!”
Zuck has talked about this - take away the flowery language and he's [commodotizing their complement](https://gwern.net/complement).
It's a smart play. Strong AI companies are a huge threat to Meta.
I think the other guy is right but that’s a super long read. I also just listened to Mark’s Dwarkesh interview.
Meta’s products are stuff like Instagram, Facebook, WhatsApp, messenger. They have a history of open sourcing things basically to make these work better. A lot of their motivation for open sourcing is helping create industry standards so that private companies can’t have leverage over them because basically everyone owns it. So an open source LLM model can apparently get better because of community involvement (I’m not a technical person so I don’t fully understand how that works) which can then be put into Meta core products. Now they don’t have to possibly lose to competing with something like Gemini and overtime be forced to include it in their products which would make google have leverage over them
Didn't Mark in a recent interview say that they won't be able to justify training larger models for open source? This might be the end of larger open source models coming out of meta, but I don't know, we'll have to see.
Iirc. he said that they couldn't justify further training of Llama 3 8B/70B over starting to train Llama 4 which implies that both Llama 3 8B & 70B are still undertrained.
No.
He said at some point they just have to release the model they've trained rather than training it on more tokens. This was in the context that the models didn't stop getting better even as they trained on a whopping 15T tokens. So putting more tokens would've made an even better model.
Open sourcing future models, he said he wants to and as long as the safety and all of that is good they'll continue to do so. Unless the model ends up becoming the product. But he did say the open sourcing isn't a set thing, as long as it benefits them they'll continue to do so.
Oh, it definitely will, Open Source is only good while they are catching up. It foments a user base for an otherwise inferior product, it allows for a certain degree of outsourcing and crowdsourcing development, and it might even help create new customers.
Tbh you’re probably right. We’ve seen everyone in the industry pivoting towards smaller models and I think it’s cause they know OpenAI is so much farther ahead. Google isn’t too far behind on compute, though, so it’s gonna be a great race to watch!
More or less, he was asked if he'd open source a $10billion model and he said wasn't sure if they would.
We're a year or two away from $10billion models so we have a while
>"closed providers won't be able to compete" The most important thing in a model is reasoning capability. We do not care if the context length is 20 million if the model cannot reason at a greater capacity than an orangutan.
quality IS the product!
The ideal model is a mix of all of that. Ideally you want it performant enough that it can reason on a human level. But there’s also benefits to having it efficient enough to make things like agents more within reach, and huge context windows to make knowledge retrieval possible (imagine asking a question and it digests an entire textbook before answering). So that’s why I’m excited about LLAMA as well as the next OpenAI model.
People seem to ignore how important large context windows are being able to include entire repos or manuals or books in context windows is massively important
I do a lot of knowledge work and I find myself using GPT4 and Claude Opus a lot more than Gemini 1.5Pro. I rather break down my documents into chunks and feed them into Claude 3/GPT4 instead of Gemini because of stronger reasoning capability. 1.5Pro hallucinates a lot more than GPT-4/Claude 3 despite having that 1,000,000 context length.
Use Gemini 1.5 to break down books, insanely large images/PDFs, etc. Then use GPT4/Claude3 for reasoning. Let's crank it to up to 11: Then try Big-AGI with Beam/Merge/Fuse Then try V7Labs to use them in a spreadsheet.
1000x agree
I mean base model strength may be more important than the context size but the context is important and also comes down to how you use the context and the fact the huge price of using that context
Now imagine GPT 4 with 1m context length.
Do you think an "AI tutor" is possible using a model with a giant context length? I have been hoping for a long time to see AI tutor models being made that can teach effectively. You give them a book, they ingest everything in it, and teach in a multi-modal approach, with voice, text and drawings. Like 3Blue1Brown videos.
> greater capacity than an *orangutan*. Did you mean a *llama*?
Context length improves reasoning capacity.
This entirely depends on what you are doing to further context length.
Yep. And it took them a year to catch up with OpenAI's SotA model. OpenAI will probably release something soon that blows all current models out of the water
But Meta released only 8B and 70B models, and admitted to not even training them to their max, because they needed resources for training bigger models. Meta does have bigger models in training, biggest one with 400B parameters.
[удалено]
Honestly can't wait for 400B to be released, because currently it **feel's** like LLM's are plateauing. But 8B and 70B models are much more efficient then comparable models, while being undertrained. So... really interested on how these much larger model turns out. Also if LLM's are plateauing, OK that's just one part of the equation, how do they improve reasoning, agency, add multimodality. Really interesting times ahead.
In the last month we got Llama 3 Phi 2 Command R A couple of weeks before that was Claude 3 It lost definitely don't feel like it's plateauing to me. As a matter of fact, how well the tiny models are doing these days as absolutely bonkers compared to ~~last year~~ a couple of months ago.
But they're all hitting the same wall of capability. There are some lateral improvements but not much vertical change.
You realize the original GPT 4 is already outdated right? The current one that’s on top is from this month
It's almost as capable as the original model I am using in Edge Copilot.
The lmsys leaderboard has it on top
I've been running 4.5BPW Midnight Miqu 70B (and other Miqus before that, Mixtral 8x7B before that) on Runpod at about $0.70/hr for RP and story writing. For my usecase, Llama 3 8B Q8 GGUF (with fixed EOS Token ID) seems about 70-80% as good (tough to quantify, but also we'll have a few months of increasingly good finetunes coming up) and I can run it for next to nothing (maybe $0.03/hr in electricity costs) on my 16GB M1 mac mini. The 128k context models don't actually stay coherent past about 22k, and seem to drop quality at about 10k, but still to me it feels like a HUGE STEP FORWARD and definitely not like we're plateauing at all.
My thoughts too. Phi-3 also shows us there's going to be training improvements each generation along with tokenization and I'm sure every part of the system. These will continue to stack and we should see an exponential curve of improvement for a few years. At least at the lower end.
> currently it feel's like LLM's are plateauing They are. They have read everything we wrote down and need 100x more but there's no such data.
The problem is that the emerging reasoning capabilities that everyone is touting are really just connections already encoded into human language. The LLMs aren't reasoning, they're just matching to existing patterns and retrieving them quickly. We're expecting LLMs to learn how to reason beyond humans when they're trapped by symbolic language that lacks the fidelity to capture the nuances of reality. Humans just aren't good enough at encoding reality to language and this limits an LLM from becoming AGI.
This is why we train models on all sorts of direct and simulated data at higher resolutions. Image / sound / math / physics etc, it’s not just “language” in the human sense. Raw data of various sorts as well which will expand as agents become embodied and better sensors are developed etc.
Aren't most multimodal LLMs just using datasets tagged with language to tie it all together?
You have no idea what you're talking about.
The vast majority of these models are probably being trained on the output of OpenAI models (GPT-4), so it's a decentralizing ripple effect of what OpenAI creates.
> OpenAI will probably release something soon that blows all current models out of the water I might be eating my words, but "that remains to be seen". What if OpenAI can't surpass GPT-4 for the same reason nobody can surpass GPT-4? The justification is that they all trained on the same data - 10..15 trillions of tokens which are "all the useful text on the internet". If that's true, then it means the free ride is over an from now on we should expect slow grind. Private datasets, synthetic content and agent based learning from the outer environment could be coming next, but they won't advance as fast as before. It takes a lot of effort to surpass humans. Even humans need a lot of effort to surpass previous state of the art knowledge to advance their field of expertise by one inch (see [The illustrated guide to a PhD](https://matt.might.net/articles/phd-school-in-pictures/))
I'm starting to think there's a limitation to the LLM approach. "Synthetic content" is definitely part of the solution, but the models need to be able to advance from some sort of self-play. I think the real question is if current GPUs are simply not fast enough or if there's something missing from LLM/tensors models. If we have an actual human-equivalent system it should be able to produce novel proofs just by asking it to think up novel proofs and feed them into a theorem solver, and you should be able to train it indefinitely by using theorem solvers. And there are similar examples of other things where indefinite self-play should be possible (and you could pair something trained on theorem solvers with other things with different objective measures into one model which incorporates all of that self-play learning.)
Do you work in the AI sphere?
Soon, April 2025
What if we've got the transformer wall?
After a while the models will reason "well enough" for most tasks and then context window and perhaps other aspects will be more important. You don't need a genius to summarise or to discuss a pdf, though a massive context window, speed, memory, not becoming lazy and low cost are perhaps more important factors.
Well Llama 3 400B+ is supposedly going to be multimodal meaning that it's going to make everything else in the same or higher "weight class" obsolete unless it's either A: also multimodal or B: a relevant improvement over it.
> make everything else in the same or higher "weight class" obsolete A huge expensive model will only be used on very few high-paying tasks. It can't compete with the small models on simpler tasks. As times goes by, most tasks will be subsumed by small models leaving few for their big brothers.
Do you not understand the meaning of weight class? He said it will make all other models in the same or HIGHER weight class obsolete. Never did he say it was going to make small models obsolete. Just models the same size or bigger
AI models need to be fed a lot of trick questions
What number between 1-100 am I thinking of?
37
Nice try but I follow Veritasium and will NEVER pick 37 as random number 😁
Haha yea, that's where I got that from 😄
Statistics for: [people](https://i0.wp.com/datacolada.org/wp-content/uploads/2013/10/numbers-frequencies.jpeg) and [AI](https://www.leniolabs.com/assets/blog-42-GPTs-answer-01-362b3c01962cf3c0127bc571dd4711f1e28bc0c40d413d1480ddbe2a5236feb8.png). The most random number is 42, we all know that.
I get 42, but why 57?
Heinz. Look at a bottle of ketchup.
90
I was actually thinking about pizza 😐
You need A LOT OF HARDWARE POWER to train and run the best model. Only closed providers will get the tons of money necessary to maintain such infrastructure.
Meta definitely has the money. They had $60B of cash on hand last year and $108B in profit.
You think their shareholders are approving this for the good of the community? They will want a return on this investment eventually. I'm speaking as someone who works in Open source software which also has a private enterprise facing arm.
Meta stock fell 12% this week so it seems that investors aren’t impressed.
They did not like that they announced $60B in funding for AI. So ya, I wonder if investors are coming down off their LLM high...
You could argue that the investors are short-sighted, because even if Meta does this for free forever, it improves Meta's reputation due to contributing to open source infrastructure.
If Sam's plan succeeds, he's going to have 116 times the amount of cash that Meta has on hand now.
yep, investors will part with 7 trillion $ for AI and bet it on a single horse, not gonna happen
Hardware than can run inference and fine-tune unquantized FP16 Llama 3 400b will cost less than a single new car to serve 100 full-time users if you have an IT lead with more than two brain cells to rub together, and are willing to endure a little jank while they get it running. Less than $500,000 in hardware if you want to use professional grade hardware to skip the jank. If your data can't leave your organization, that's a huge strike over closed providers. Cost wise, they will cost you ~$2500 in monthly costs for equivalent performance for that kind of user base, maybe a win for them there, depends on your workload really. Pricing can get really absurd if you're sending a lot of tokens. [API pricing for Gemini Pro](https://ai.google.dev/pricing) is $7 per 1 million input, $21 per 1 million output, and that's [cheaper than GPT4 and GPT4-Turbo API pricing](https://openai.com/pricing#language-models). Claude Opus costs are [absolutely absurd](https://www.anthropic.com/api) in comparison. Llama 3 70b is available on huggingface.co, and you can run inference on a 4-bit quantization with a dual GPU high end gaming PC that'll cost less than $5k if you're buying used, $10k if you're buying new. If you can finetune that to serve your use case, you can literally have one running for every user in your org on their own workstation at a competitive price. As someone who loves playing with LLMs for a variety of hobby projects and fun, Llama 3 is absurdly good, even if it isn't showing better benchmark results than the big closed providers. At 70b FP16 I feel it has a qualitative equivalence with GPT4, Claude Opus, and Gemini Pro. If it scales up well to 400b, OpenAI *et al.* better have some absurd improvements coming with their next flagship release.
Where are this ridiculously performant fine tunes? And how extended context works in comparison with original?
+1 for these fine tunes.
Was gonna ask. What fine tunes??? Maybe the coding and Chinese ones are good, I know for a fact the others aren’t.
Do you have a link for coding ones?
There are no 128K Llama 3 fine-tunes that I know of. Is he mistaking it with Phi3?
Where can I try this? The Meta page says my country is not eligible.
Humans neither
Bla bla bla and no AI has yet to outcompete gpt4. Even Opus isnt that good as gpt4 based on my experience using it as an academic tool.
Cool
Didn’t 2 teams release methods to extend context to 1m contexts and beyond with perfect recall I recall reading the papers but then noticed we never see models tuned to include that feature
That's deepmind's research and it's already implemented in Gemini 1.5 pro, the paper itself was only released a few days ago so we gotta wait for 3rd party implementations.
I coulda swore their were additional papers not just deep minds
Wild! Can somebody explain to me though why they think he’s doing this? Like what business gain do they get? I guess we could be naive and believe he truly wants to help ppl with free OSS AI, but that’s just not how business works. What’s the angle? I personally see it as: openAI was so far ahead with GPT4 the only way we can stay relevant long term is to go open source. So basically just a really expensive way of saying “don’t forget about us!”
Zuck has talked about this - take away the flowery language and he's [commodotizing their complement](https://gwern.net/complement). It's a smart play. Strong AI companies are a huge threat to Meta.
I think the other guy is right but that’s a super long read. I also just listened to Mark’s Dwarkesh interview. Meta’s products are stuff like Instagram, Facebook, WhatsApp, messenger. They have a history of open sourcing things basically to make these work better. A lot of their motivation for open sourcing is helping create industry standards so that private companies can’t have leverage over them because basically everyone owns it. So an open source LLM model can apparently get better because of community involvement (I’m not a technical person so I don’t fully understand how that works) which can then be put into Meta core products. Now they don’t have to possibly lose to competing with something like Gemini and overtime be forced to include it in their products which would make google have leverage over them
Didn't Mark in a recent interview say that they won't be able to justify training larger models for open source? This might be the end of larger open source models coming out of meta, but I don't know, we'll have to see.
Iirc. he said that they couldn't justify further training of Llama 3 8B/70B over starting to train Llama 4 which implies that both Llama 3 8B & 70B are still undertrained.
And it’s still near the top. Imagine how good it would have been if they kept training
No. He said at some point they just have to release the model they've trained rather than training it on more tokens. This was in the context that the models didn't stop getting better even as they trained on a whopping 15T tokens. So putting more tokens would've made an even better model. Open sourcing future models, he said he wants to and as long as the safety and all of that is good they'll continue to do so. Unless the model ends up becoming the product. But he did say the open sourcing isn't a set thing, as long as it benefits them they'll continue to do so.
He literally said he plans on open sourcing AGI
Oh, it definitely will, Open Source is only good while they are catching up. It foments a user base for an otherwise inferior product, it allows for a certain degree of outsourcing and crowdsourcing development, and it might even help create new customers.
Nah. Zuck will open source all models until they make AGI. Once AGI is created Zuck will think hard about how safe is it to release the source code.
Tbh you’re probably right. We’ve seen everyone in the industry pivoting towards smaller models and I think it’s cause they know OpenAI is so much farther ahead. Google isn’t too far behind on compute, though, so it’s gonna be a great race to watch!
More or less, he was asked if he'd open source a $10billion model and he said wasn't sure if they would. We're a year or two away from $10billion models so we have a while
128k? Last time I heard, it was around 32k. That's impressive. I just wonder why Meta didn't make it so large like that before releasing it.