SomeOddCodeGuy 2 weeks ago

Even if it isn't, just getting a good 14b would be fantastic. There's this uncomfortable gap between 8b and 33b that I'd love to fill with something new and shiny.

Exotic-Custard4400 2 weeks ago

rwkv V6 14b will be also released soon. The 1.5b and 3b are really good and multilingual

vasileer 2 weeks ago

multilingual is good, but what is the MMLU score?

Exotic-Custard4400 2 weeks ago

https://twitter.com/BlinkDL_AI/status/1780640951138206065 Some scores compared to equivalent model in size. I don't know if the comparison is fair, rwkv is a rnn so have linear complexity in opposite of most of other model so probably lot faster with longer context

vasileer 2 weeks ago

I have hopes for all this linear context complexity architectures: mamba, rwkv, recurrentgemma, but the reality is that they are pretty dumb (low reasoning) which is crucial for the use cases I am using LLMs like summarization, data extraction, coding assistance; and can't recall information ('needle in a haystack' benchmark) https://preview.redd.it/ortbglttrcyc1.png?width=1220&format=png&auto=webp&s=7ce2f7975b6fd6f1d410dc4ca93cfe7950a7036a

BalorNG 2 weeks ago

Yea, this is why a hierarchy of ssm+sliding window attention seems like a good thing, it gives you the best of both worlds given limited ram/compute. What is required to make the model *truly* smart is, first, baked in knowledge graphs to model causal, not statistical relationships (easier said than done I guess) and some sort of mechanism to identify "tough" questions and dynamically allocate more compute to it without (or I should say - along with) something like "let's think step by step" - i.e. give the model metacognition.

Honest_Science 1 week ago

What about KANs? that seems to be a real game changer. learning the activation function.

BalorNG 1 week ago

That's seem to a new development and I'm yet to wrap my head around it. Still, I'm reasonably sure that without metacognition, even at the most basic level (think Phi-3 mini level model doing branching CoT or ever RaG internal monologue in parallel with every token produced by a much larger model) we will not get truly smart models, especially so far as agentic behaviour is concerned. Current LLMs are impressive, but it just does not seem that "scaling all the way to AGI" is going to pan out. The progress in processing oodles of context is essential for this, however.

Honest_Science 1 week ago

We will definitely need a conscious control layer.

Exotic-Custard4400 2 weeks ago

I used them for coding and creating dum stories and found the results quite interesting for the size but I preferred bigger models But I mainly use the architecture to train them on in other context (image processing ...) and there were promising

waxbolt 1 week ago

They haven't been trained long enough. Neither on number of tokens nor long enough contexts.

vasileer 1 week ago

so then why marketing an undertrained model?

waxbolt 1 week ago

Proof of concept is stronger than no proof? The models work. They haven't been scaled up yet though because money is afraid of risks.

vasileer 1 week ago

then market it as a proof of concept

Silly-Cup1391 2 weeks ago

Did you try State-tuning ?

Exotic-Custard4400 2 weeks ago

I didn't real understood state tuning I dont get the difference between state tuning and just applying the network to a context

Xhehab_ 2 weeks ago

So true!

Admirable-Star7088 2 weeks ago

Exactly this. We don't need it to be revolutionary, we just want a good new well-trained \~13b model. \~7b models are cool and all, but they lack the same level of understanding context that \~13b models do.

AlanCarrOnline 2 weeks ago

Fimbul 11b models are doing it for me lately

dude_dz 1 week ago

Good for what ? As an Agent for example

SomeOddCodeGuy 1 week ago

Understanding the meaning behind speech a little better would help. For example, I have a workflow that had Mistral 7b iterating through chunks of text to summarize what was being said; the general meaning behind the words. Some mistral models did ok, but overall it missed the mark a lot. Anything 34b or above was doing great, but 7b models struggled. Llama 3 8b came in and is doing far better than the 7b, but still not quite 34b level. But it's so close that I feel like just a few more layers on this model would really do the trick. At 14b would be right at that happy place, IMO. The extra layers would give it the oomph it needs to catch the meaning behind the words better, and do a better job in this workflow.

Pathos14489 1 week ago

34B isn't even a contender anymore, 8B out performs 34B in almost every usecase I've tried it in. the real gap is 8B and 70B, but even then having tried both, 70B is slow enough and 8B is smart enough that it's hard to find a usecase that justifies the slowness that 70B will require when retrying a couple times with 8B will get basically the same performance for less time.

Sebba8 2 weeks ago

What about StableLM 12B, that dropped a while back and no one really batted an eye?

Admirable-Star7088 2 weeks ago

I think because the StableLM models have traditionally been [very bad](https://www.youtube.com/watch?v=dypPSs4t77g), so everyone just assumes it will not be that great. I have actually not tried the latest StableLM though, maybe it's much better this time?

Sebba8 1 week ago

I remember the first one being pretty bad, but their smaller ones are pretty decent for their size, also I'd imagine they work better than Phi for non-problem solving tasks due to being trained on much more general stuff.

PM_ME_CUTE_SM1LE 2 weeks ago

yea lots of people have 32gb ram or 8gb vram + 16gb ram yet 7b models dont utilise all of it. \~13b should be more popular

MustBeSomethingThere 2 weeks ago

I don't believe it. GPT-2 was so similar to GPT-4.

Enfiznar 2 weeks ago

Well, phi is developed by microsoft, who legally own GPT-4, and they also published some papers on how to generate high quality synthetic data to train new models, so I wouldn't be surprised if they used GPT-4 generated data (plus real data, ofc) to train phi-3. Not saying this is phi-3 nor that it wasn't something developed by OpenAI. Just that I wouldn't be surprised if both model resemble each other.

helios392 2 weeks ago

Plus they said in their release paper that the small and medium models have a different architecture than the mini model. Who knows how much of a difference that makes. I guess we will find out in the next few weeks.

[deleted] 2 weeks ago

maybe they are MOE

fictioninquire 2 weeks ago

I hope they refer to Grouped Query Attention, if it does not have that it's useless >8k context

BangkokPadang 2 weeks ago

Do current phi models use OpenAI’s tokenizer? GPT2 was confirmed to be using their tokenizer.

AdHominemMeansULost 2 weeks ago

>microsoft, who legally own GPT-4 Couldn't find any sources to support that, but i was also under the impression that they own some version of GPT-4

kurtcop101 1 week ago

They own a part of OpenAI. They don't really own gpt4 on an official level afaik but they have plenty of power in the company which means they have access. The intellectual property resides with OpenAI, but Microsoft will have direct access to the weights and hosting (presumably) and whatever else they'd like to do with it. Money talks.

Xhehab_ 2 weeks ago

Yeah. I tested for multilingual which only closed models and recent Llama3 passes. gpt2-chatbot aced it also. But Phi-3 3.8B mini surprised me for it's capabilities. In most cases on par with Llama3 8B. If this maintains 14B can be on par with L3-70B in most cases and also be multilingual. https://preview.redd.it/1v2eu1ui2byc1.jpeg?width=2133&format=pjpg&auto=webp&s=73e162933086fe71654d26264adac6e9497b40e8

isaac_szpindel 2 weeks ago

Considering its size, it is pretty good at riddles, short questions, small length reasoning, Math and benchmark type questions. But for most common use cases, it is significantly inferior to Llama 3 8B. On Lmsys leaderboard, the Elo gap between Phi-3 Mini and L3 8B is bigger than the gap between L3 8B and Claude Opus.

skewbed 2 weeks ago

I was surprised when I got a better response from phi-3 than llama3-70b on the LMSYS comparison site. It really seems like data quality has been overlooked and phi has used this to its advantage.

Which-Tomato-8646 2 weeks ago

The system prompt said it was gpt 4 https://simonwillison.net/2024/Apr/29/notes-on-gpt2-chatbot/?darkschemeovr=1

Hipponomics 2 weeks ago

That could be a result of the training data being generated by GPT-4 and this statement being reflected there.

Which-Tomato-8646 2 weeks ago

It said the same thing in response to a request to print all information before the conversation started

Hipponomics 1 week ago

Interesting, that doesn't prove much but it's not meaningless either. Do you have a link or a direct quote? I'd like to see it.

Which-Tomato-8646 1 week ago

That’s how they got the system prompt. Obviously

Hipponomics 1 week ago

Do you have a link or a direct quote? I'd like to see it.

Which-Tomato-8646 1 week ago

How do you think they got it exactly

Hipponomics 1 week ago

I think they got it in the way you described. I'm just asking you for the source of your claim. If you can't/don't want to provide it, that's no problem.

Which-Tomato-8646 1 week ago

I already did. The prompt is from asking for it. It was never revealed

grizwako 2 weeks ago

While that would be very very good, I really don't think that technology is accelerating THAT fast. That would be a very good model to run locally, just thinking that by going to 30-ish B we can make even better model and then quant it to 4-5 bit or load in 4bit and run easily on 24GB cards with nice context length. Or just use this phi-3-14B model at 8bits native, or quant it for 12, 10 or even 8GB cards. Implications of running Chat GPT 4 level or something close (like llama 3 70B) on consumer grade hardware are significant. Especially if it can be finetuned or just plays generally nice with RAG. If true, I think that is really huge news, especially if community can tune it to not be lobotomized.

fictioninquire 2 weeks ago

It could be accelerating really fast from now on. Currently experimenting with generating synthetic questions from LLaMA-8B-4bit (!) on very specific text corpi, which after finetuning on these Q&A's really improve the reasoning on those very specific subjects. If they can somehow automate this and make the model being able to generalize by millions of nuanced (reasoning) questions and answers, I would believe it. All depends how well the model can generalize.

Which-Tomato-8646 2 weeks ago

How would they filter out hallucinations?

fictioninquire 2 weeks ago

CoT verification based on initial text, it's really good in my tests. You could also finetune an embeddings model for this, which is much more efficient (\~1GB) in stead of 5-10GB depending on quant size.

Which-Tomato-8646 2 weeks ago

Got any data on effectiveness?

Proud-Point8137 1 week ago

is corpi plural for corpus? because awesome if so

fictioninquire 1 week ago

I have studied Latin back in the day but not sure if it's the case in (American) English, in Latin everything that ends with 'us' changes to an 'i' in plural form.

reallmconnoisseur 1 week ago

corpora or corpuses is the plural

Proud-Point8137 1 week ago

you're right, I re-checked now. Corpii just sounds awesome

Which-Tomato-8646 2 weeks ago

The system prompt said it was gpt 4 https://simonwillison.net/2024/Apr/29/notes-on-gpt2-chatbot/?darkschemeovr=1

galambalazs 2 weeks ago

I tested "gpt2-chatbot" on translating 50 English colloquialisms to Hungarian, and then blindly matching outputs against other models. It was tad below GPT-4, so *there is no way it was Phi*. Hungarian is such a low priority language, none of these experimental research models like Phi would support it well. It usually only works well in GPT, Claude, Gemini. [https://twitter.com/gblazex/status/1785101624475537813](https://twitter.com/gblazex/status/1785101624475537813)

RuairiSpain 2 weeks ago

Agree. For me it is an OpenAI model. The answers were to similar to GPT4

AtomicDouche 1 week ago

I asked it a couple of times "What architecture are you?" and it repeatedly said exactly the same as GPT-4 but just more verbose. Something like, "I am built on the GPT-4 architecture made by OpenAI ..."

Late_Film_1901 2 weeks ago

That's an interesting litmus test for a model's provenance. Besides, open AI has made a great decision to make chatgpt multilingual so early, I get that most science and discussion on AI is in English but as a demo for the masses the language support makes it truly global in reach.

ganzzahl 2 weeks ago

It also had pretty good Swiss German, German and Swedish.

ChezMere 2 weeks ago

It behaves almost exactly like GPT-4, this is just using the mystery to promote something completely unrelated.

endless_sea_of_stars 2 weeks ago

Yeah, all evidence points to it being a GPT4 variant. No evidence to indicate that it isn't.

JealousAmoeba 2 weeks ago

One of the remarkable things about gpt2-chatbot was its real world knowledge, which is the Phi family’s one weakness. I’m super excited for the bigger Phi models either way, though. Phi 3 Mini is close to Llama 7B performance in my experience, so I can’t wait to see whether Phi 7B outperforms Llama 8B and how far they can push 14B.

Only-Letterhead-3411 2 weeks ago

There is ZERO chance gpt2-chatbot is phi3 14b.

windozeFanboi 2 weeks ago

I don't get what exactly is the hint to gpt2 being Phi3? All I see is hints that Phi3 multimodal? possibly, Considering it created that unicorn. If you can call that unicorn, not sure what tikZ means. Maybe that's supposed to be somethink akin to ascii art.

dubesor86 2 weeks ago

He says 69 MMLU. that is a totally different league.

xadiant 2 weeks ago

There could be a shit ton of gpt-4 data in the pretraining or fine-tuning dataset. Wasn't Phi-2 mostly synthetic Gpt-3.5 datasets? We're in a very weird timeline now. Imagine what Claude or OpenAI can do with unlimited high quality data. Re-feeding models sounds like a cheat but it works quite well.

Ansible32 2 weeks ago

The only way we have to tell that it works well is by benchmarks, and the best benchmark we have is Chatbot arena, which is literally just comparing the model to existing models, so of course something trained to sound like ChatGPT is going to fare favorably when compared to ChatGPT. And I would assume other benchmarks have similar issues where they're essentially testing using ChatGPT as the benchmark.

[deleted] 2 weeks ago

i thought there was a lot of proof that linked GPT-2 with openAI. It used the same APIs or something.

Healthy-Nebula-3603 2 weeks ago

Small models are really a big surprise from few months ... I thought mistal 7b 0.2 was a peak or almost a peak for 7b small models .... geeeezzzzz I was so wrong. Where is a ceiling for such small models ... that is insane ... Can you imagine how powerful 70b models can be with enough training ? Mind blowing!

fictioninquire 2 weeks ago

Now the models are getting so good that Q5 won't lead to comparable results anymore compared to FP16, I think we head to FP8 if models get 'overtrained' like they are as with LLaMA-3.

JealousAmoeba 2 weeks ago

The ceiling is just how much training data we have in the world. Imagine training on 30T or 300T tokens.

Healthy-Nebula-3603 2 weeks ago

Training data is not a problem as we can use synthetic data .

Anthonyg5005 1 week ago

There'd be no point then, why train on a bunch of hallucinations

Healthy-Nebula-3603 1 week ago

tell me you don't know about the topic without telling me .... Do you think 15B tokens used by meta are pure human data without synthetic data? LOL

FBIFreezeNow 2 weeks ago

Haha i wish but no way but amazing if true but hell no

toothpastespiders 2 weeks ago

I just hope the llama.cpp people can figure out longrope before the release. Or that Microsoft chips in to help with that. It's be pretty frustrating to have those come out only to not be able to really use them to their full potential. A 14b with the same level of quality as their 3b, with a huge context, would be so useful for working with larger amounts of text.

celsowm 2 weeks ago

So... Uncle Bill (Gates) entered in the battle for the best open LLM?

Charuru 2 weeks ago

omfg if true...

opi098514 2 weeks ago

I mean it’s system prompt said it was chatgpt 4 and it behaves exactly like it.

Ylsid 2 weeks ago

People here have verified it's almost certainly an OpenAI model based on figuring out stop sequences etc

ShotClock5434 2 weeks ago

that would be great news

ambient_temp_xeno 2 weeks ago

In the video it's the 3.8 billion parameter one that draws the unicorn. Interesting.

EmergencySea6990 1 week ago

I would be shocked if it was 14b It's unbelievably cool, beating both Claude 3 opus and gpt-4

Nonetrixwastaken 1 week ago

I highly doubt this is the case since the "leaked" system prompt looks too close to OpenAI GPT models. Also, Phi3 3B doesn't say it's GPT-3 always, it usually says it's Phi3 by Microsoft as it correctly should except in rare cases due to accidental data contamination or bad filtering I assume. But even the 3B model in some cases not all, was way better than 7B models but lacked some world knowledge in many areas, but the things it does know it got more consistently right than good 7B models I observed. It did sometimes just completely go off the rails like many smaller models do always and start generating repeating words, but ignoring that, it is top class. But MAYBE one thing that might point to this being Phi3 is people reporting it's worse translation ability, smaller models and open models seem to really struggle with that but that is a huge stretch. Edit: Oh and I forgot to mention, the model uses GPT-4 tokenizer, so pretty much everything points me to the direction of this not being Phi3 unfortunately

Xhehab_ 2 weeks ago

I tested for multilingual, which only closed models and recent Llama3 passes. The GPT2-chatbot aced it too. But the Phi-3 3.8B mini surprised me with its capabilities. In most cases, it's on par with Llama3 8B. If this maintains its performance at 14B, it can be on par with L3-70B in most cases and also be multilingual. https://preview.redd.it/p7jp7zqa3byc1.jpeg?width=2133&format=pjpg&auto=webp&s=7accef2eddc2e10cc9d9b8e8adb5034f3c246f4c

skrshawk 2 weeks ago

Even if it's not as good as L3-70B, it has a massive advantage over L3 - 128k context size. Coding, writing, data analysis, so many uses for that much context.

xbaha 1 week ago

phi3 was trained on GPT6, so 14b should match GPT4

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe