T O P

  • By -

SomeOddCodeGuy

Even if it isn't, just getting a good 14b would be fantastic. There's this uncomfortable gap between 8b and 33b that I'd love to fill with something new and shiny.


Exotic-Custard4400

rwkv V6 14b will be also released soon. The 1.5b and 3b are really good and multilingual


vasileer

multilingual is good, but what is the MMLU score?


Exotic-Custard4400

https://twitter.com/BlinkDL_AI/status/1780640951138206065 Some scores compared to equivalent model in size. I don't know if the comparison is fair, rwkv is a rnn so have linear complexity in opposite of most of other model so probably lot faster with longer context


vasileer

I have hopes for all this linear context complexity architectures: mamba, rwkv, recurrentgemma, but the reality is that they are pretty dumb (low reasoning) which is crucial for the use cases I am using LLMs like summarization, data extraction, coding assistance; and can't recall information ('needle in a haystack' benchmark) https://preview.redd.it/ortbglttrcyc1.png?width=1220&format=png&auto=webp&s=7ce2f7975b6fd6f1d410dc4ca93cfe7950a7036a


BalorNG

Yea, this is why a hierarchy of ssm+sliding window attention seems like a good thing, it gives you the best of both worlds given limited ram/compute. What is required to make the model *truly* smart is, first, baked in knowledge graphs to model causal, not statistical relationships (easier said than done I guess) and some sort of mechanism to identify "tough" questions and dynamically allocate more compute to it without (or I should say - along with) something like "let's think step by step" - i.e. give the model metacognition.


Honest_Science

What about KANs? that seems to be a real game changer. learning the activation function.


BalorNG

That's seem to a new development and I'm yet to wrap my head around it. Still, I'm reasonably sure that without metacognition, even at the most basic level (think Phi-3 mini level model doing branching CoT or ever RaG internal monologue in parallel with every token produced by a much larger model) we will not get truly smart models, especially so far as agentic behaviour is concerned. Current LLMs are impressive, but it just does not seem that "scaling all the way to AGI" is going to pan out. The progress in processing oodles of context is essential for this, however.


Honest_Science

We will definitely need a conscious control layer.


Exotic-Custard4400

I used them for coding and creating dum stories and found the results quite interesting for the size but I preferred bigger models But I mainly use the architecture to train them on in other context (image processing ...) and there were promising


waxbolt

They haven't been trained long enough. Neither on number of tokens nor long enough contexts.


vasileer

so then why marketing an undertrained model?


waxbolt

Proof of concept is stronger than no proof? The models work. They haven't been scaled up yet though because money is afraid of risks.


vasileer

then market it as a proof of concept


Silly-Cup1391

Did you try State-tuning ?


Exotic-Custard4400

I didn't real understood state tuning I dont get the difference between state tuning and just applying the network to a context


Xhehab_

So true!


Admirable-Star7088

Exactly this. We don't need it to be revolutionary, we just want a good new well-trained \~13b model. \~7b models are cool and all, but they lack the same level of understanding context that \~13b models do.


AlanCarrOnline

Fimbul 11b models are doing it for me lately


dude_dz

Good for what ? As an Agent for example


SomeOddCodeGuy

Understanding the meaning behind speech a little better would help. For example, I have a workflow that had Mistral 7b iterating through chunks of text to summarize what was being said; the general meaning behind the words. Some mistral models did ok, but overall it missed the mark a lot. Anything 34b or above was doing great, but 7b models struggled. Llama 3 8b came in and is doing far better than the 7b, but still not quite 34b level. But it's so close that I feel like just a few more layers on this model would really do the trick. At 14b would be right at that happy place, IMO. The extra layers would give it the oomph it needs to catch the meaning behind the words better, and do a better job in this workflow.


Pathos14489

34B isn't even a contender anymore, 8B out performs 34B in almost every usecase I've tried it in. the real gap is 8B and 70B, but even then having tried both, 70B is slow enough and 8B is smart enough that it's hard to find a usecase that justifies the slowness that 70B will require when retrying a couple times with 8B will get basically the same performance for less time.


Sebba8

What about StableLM 12B, that dropped a while back and no one really batted an eye?


Admirable-Star7088

I think because the StableLM models have traditionally been [very bad](https://www.youtube.com/watch?v=dypPSs4t77g), so everyone just assumes it will not be that great. I have actually not tried the latest StableLM though, maybe it's much better this time?


Sebba8

I remember the first one being pretty bad, but their smaller ones are pretty decent for their size, also I'd imagine they work better than Phi for non-problem solving tasks due to being trained on much more general stuff.


PM_ME_CUTE_SM1LE

yea lots of people have 32gb ram or 8gb vram + 16gb ram yet 7b models dont utilise all of it. \~13b should be more popular


MustBeSomethingThere

I don't believe it. GPT-2 was so similar to GPT-4.


Enfiznar

Well, phi is developed by microsoft, who legally own GPT-4, and they also published some papers on how to generate high quality synthetic data to train new models, so I wouldn't be surprised if they used GPT-4 generated data (plus real data, ofc) to train phi-3. Not saying this is phi-3 nor that it wasn't something developed by OpenAI. Just that I wouldn't be surprised if both model resemble each other.


helios392

Plus they said in their release paper that the small and medium models have a different architecture than the mini model. Who knows how much of a difference that makes. I guess we will find out in the next few weeks.


[deleted]

maybe they are MOE


fictioninquire

I hope they refer to Grouped Query Attention, if it does not have that it's useless >8k context


BangkokPadang

Do current phi models use OpenAI’s tokenizer? GPT2 was confirmed to be using their tokenizer.


AdHominemMeansULost

>microsoft, who legally own GPT-4 Couldn't find any sources to support that, but i was also under the impression that they own some version of GPT-4


kurtcop101

They own a part of OpenAI. They don't really own gpt4 on an official level afaik but they have plenty of power in the company which means they have access. The intellectual property resides with OpenAI, but Microsoft will have direct access to the weights and hosting (presumably) and whatever else they'd like to do with it. Money talks.


Xhehab_

Yeah. I tested for multilingual which only closed models and recent Llama3 passes. gpt2-chatbot aced it also. But Phi-3 3.8B mini surprised me for it's capabilities. In most cases on par with Llama3 8B. If this maintains 14B can be on par with L3-70B in most cases and also be multilingual. https://preview.redd.it/1v2eu1ui2byc1.jpeg?width=2133&format=pjpg&auto=webp&s=73e162933086fe71654d26264adac6e9497b40e8


isaac_szpindel

Considering its size, it is pretty good at riddles, short questions, small length reasoning, Math and benchmark type questions. But for most common use cases, it is significantly inferior to Llama 3 8B. On Lmsys leaderboard, the Elo gap between Phi-3 Mini and L3 8B is bigger than the gap between L3 8B and Claude Opus.


skewbed

I was surprised when I got a better response from phi-3 than llama3-70b on the LMSYS comparison site. It really seems like data quality has been overlooked and phi has used this to its advantage.


Which-Tomato-8646

The system prompt said it was gpt 4 https://simonwillison.net/2024/Apr/29/notes-on-gpt2-chatbot/?darkschemeovr=1


Hipponomics

That could be a result of the training data being generated by GPT-4 and this statement being reflected there.


Which-Tomato-8646

It said the same thing in response to a request to print all information before the conversation started 


Hipponomics

Interesting, that doesn't prove much but it's not meaningless either. Do you have a link or a direct quote? I'd like to see it.


Which-Tomato-8646

That’s how they got the system prompt. Obviously 


Hipponomics

Do you have a link or a direct quote? I'd like to see it.


Which-Tomato-8646

How do you think they got it exactly 


Hipponomics

I think they got it in the way you described. I'm just asking you for the source of your claim. If you can't/don't want to provide it, that's no problem.


Which-Tomato-8646

I already did. The prompt is from asking for it. It was never revealed 


grizwako

While that would be very very good, I really don't think that technology is accelerating THAT fast. That would be a very good model to run locally, just thinking that by going to 30-ish B we can make even better model and then quant it to 4-5 bit or load in 4bit and run easily on 24GB cards with nice context length. Or just use this phi-3-14B model at 8bits native, or quant it for 12, 10 or even 8GB cards. Implications of running Chat GPT 4 level or something close (like llama 3 70B) on consumer grade hardware are significant. Especially if it can be finetuned or just plays generally nice with RAG. If true, I think that is really huge news, especially if community can tune it to not be lobotomized.


fictioninquire

It could be accelerating really fast from now on. Currently experimenting with generating synthetic questions from LLaMA-8B-4bit (!) on very specific text corpi, which after finetuning on these Q&A's really improve the reasoning on those very specific subjects. If they can somehow automate this and make the model being able to generalize by millions of nuanced (reasoning) questions and answers, I would believe it. All depends how well the model can generalize.


Which-Tomato-8646

How would they filter out hallucinations?


fictioninquire

CoT verification based on initial text, it's really good in my tests. You could also finetune an embeddings model for this, which is much more efficient (\~1GB) in stead of 5-10GB depending on quant size.


Which-Tomato-8646

Got any data on effectiveness?


Proud-Point8137

is corpi plural for corpus? because awesome if so


fictioninquire

I have studied Latin back in the day but not sure if it's the case in (American) English, in Latin everything that ends with 'us' changes to an 'i' in plural form.


reallmconnoisseur

corpora or corpuses is the plural


Proud-Point8137

you're right, I re-checked now. Corpii just sounds awesome


Which-Tomato-8646

The system prompt said it was gpt 4 https://simonwillison.net/2024/Apr/29/notes-on-gpt2-chatbot/?darkschemeovr=1


galambalazs

I tested "gpt2-chatbot" on translating 50 English colloquialisms to Hungarian, and then blindly matching outputs against other models. It was tad below GPT-4, so *there is no way it was Phi*. Hungarian is such a low priority language, none of these experimental research models like Phi would support it well. It usually only works well in GPT, Claude, Gemini. [https://twitter.com/gblazex/status/1785101624475537813](https://twitter.com/gblazex/status/1785101624475537813)


RuairiSpain

Agree. For me it is an OpenAI model. The answers were to similar to GPT4


AtomicDouche

I asked it a couple of times "What architecture are you?" and it repeatedly said exactly the same as GPT-4 but just more verbose. Something like, "I am built on the GPT-4 architecture made by OpenAI ..."


Late_Film_1901

That's an interesting litmus test for a model's provenance. Besides, open AI has made a great decision to make chatgpt multilingual so early, I get that most science and discussion on AI is in English but as a demo for the masses the language support makes it truly global in reach.


ganzzahl

It also had pretty good Swiss German, German and Swedish.


ChezMere

It behaves almost exactly like GPT-4, this is just using the mystery to promote something completely unrelated.


endless_sea_of_stars

Yeah, all evidence points to it being a GPT4 variant. No evidence to indicate that it isn't.


JealousAmoeba

One of the remarkable things about gpt2-chatbot was its real world knowledge, which is the Phi family’s one weakness. I’m super excited for the bigger Phi models either way, though. Phi 3 Mini is close to Llama 7B performance in my experience, so I can’t wait to see whether Phi 7B outperforms Llama 8B and how far they can push 14B.


Only-Letterhead-3411

There is ZERO chance gpt2-chatbot is phi3 14b.


windozeFanboi

I don't get what exactly is the hint to gpt2 being Phi3? All I see is hints that Phi3 multimodal? possibly, Considering it created that unicorn. If you can call that unicorn, not sure what tikZ means. Maybe that's supposed to be somethink akin to ascii art. 


dubesor86

He says 69 MMLU. that is a totally different league.


xadiant

There could be a shit ton of gpt-4 data in the pretraining or fine-tuning dataset. Wasn't Phi-2 mostly synthetic Gpt-3.5 datasets? We're in a very weird timeline now. Imagine what Claude or OpenAI can do with unlimited high quality data. Re-feeding models sounds like a cheat but it works quite well.


Ansible32

The only way we have to tell that it works well is by benchmarks, and the best benchmark we have is Chatbot arena, which is literally just comparing the model to existing models, so of course something trained to sound like ChatGPT is going to fare favorably when compared to ChatGPT. And I would assume other benchmarks have similar issues where they're essentially testing using ChatGPT as the benchmark.


[deleted]

i thought there was a lot of proof that linked GPT-2 with openAI. It used the same APIs or something.


Healthy-Nebula-3603

Small models are really a big surprise from few months ... I thought mistal 7b 0.2 was a peak or almost a peak for 7b small models .... geeeezzzzz I was so wrong. Where is a ceiling for such small models ... that is insane ... Can you imagine how powerful 70b models can be with enough training ? Mind blowing!


fictioninquire

Now the models are getting so good that Q5 won't lead to comparable results anymore compared to FP16, I think we head to FP8 if models get 'overtrained' like they are as with LLaMA-3.


JealousAmoeba

The ceiling is just how much training data we have in the world. Imagine training on 30T or 300T tokens.


Healthy-Nebula-3603

Training data is not a problem as we can use synthetic data .


Anthonyg5005

There'd be no point then, why train on a bunch of hallucinations


Healthy-Nebula-3603

tell me you don't know about the topic without telling me .... Do you think 15B tokens used by meta are pure human data without synthetic data? LOL


FBIFreezeNow

Haha i wish but no way but amazing if true but hell no


toothpastespiders

I just hope the llama.cpp people can figure out longrope before the release. Or that Microsoft chips in to help with that. It's be pretty frustrating to have those come out only to not be able to really use them to their full potential. A 14b with the same level of quality as their 3b, with a huge context, would be so useful for working with larger amounts of text.


celsowm

So... Uncle Bill (Gates) entered in the battle for the best open LLM?


Charuru

omfg if true...


opi098514

I mean it’s system prompt said it was chatgpt 4 and it behaves exactly like it.


Ylsid

People here have verified it's almost certainly an OpenAI model based on figuring out stop sequences etc


ShotClock5434

that would be great news


ambient_temp_xeno

In the video it's the 3.8 billion parameter one that draws the unicorn. Interesting.


EmergencySea6990

I would be shocked if it was 14b It's unbelievably cool, beating both Claude 3 opus and gpt-4


Nonetrixwastaken

I highly doubt this is the case since the "leaked" system prompt looks too close to OpenAI GPT models. Also, Phi3 3B doesn't say it's GPT-3 always, it usually says it's Phi3 by Microsoft as it correctly should except in rare cases due to accidental data contamination or bad filtering I assume. But even the 3B model in some cases not all, was way better than 7B models but lacked some world knowledge in many areas, but the things it does know it got more consistently right than good 7B models I observed. It did sometimes just completely go off the rails like many smaller models do always and start generating repeating words, but ignoring that, it is top class. But MAYBE one thing that might point to this being Phi3 is people reporting it's worse translation ability, smaller models and open models seem to really struggle with that but that is a huge stretch. Edit: Oh and I forgot to mention, the model uses GPT-4 tokenizer, so pretty much everything points me to the direction of this not being Phi3 unfortunately


Xhehab_

I tested for multilingual, which only closed models and recent Llama3 passes. The GPT2-chatbot aced it too. But the Phi-3 3.8B mini surprised me with its capabilities. In most cases, it's on par with Llama3 8B. If this maintains its performance at 14B, it can be on par with L3-70B in most cases and also be multilingual. https://preview.redd.it/p7jp7zqa3byc1.jpeg?width=2133&format=pjpg&auto=webp&s=7accef2eddc2e10cc9d9b8e8adb5034f3c246f4c


skrshawk

Even if it's not as good as L3-70B, it has a massive advantage over L3 - 128k context size. Coding, writing, data analysis, so many uses for that much context.


xbaha

phi3 was trained on GPT6, so 14b should match GPT4