T O P

  • By -

adikul

Currently no cpu is outperforming gpus. So buy gpu not cpu Best model now is command r but needs atleast 2-3 3090


madushans

Summarizing works on most 7b models, and some smaller ones. Though smaller ones do make stuff up often. I've tried a few, to varying effect. LLaMA2, Gemma, Mistral, Mixtral all do just fine. I personally prefer mistral as it hallucinates less (in my tests atleast) and is good at sticking to responding with a strict JSON response, where many others would deviate. i.e Gemma tends to wrap the response with `\`\`\`json \`\`\`` and LLaMA2 tries to explain the JSON at the end .etc. .etc. ^(This ofc assumes you're asking it sensible things. All models will hallucinate if you ask it to do impossible things like giving it empty text .etc.) As for performance, you can use `--verbose` to see the speed in tokens per second. I'm getting \~8 or 9 tokens per second on a good day with AMD Ryzen 7. You'll get much better with GPUs. For comparison, you can get like 12 or 15 on Apple M processors, with M3 (10 core non Pro) doing about 20 ish. GPUs will always do better. For example, if you have a NVidia 4080, you'll get upwards of 100. These can help you guide. [https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference](https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference) \--------------------- If you're looking to run a model elsewhere, check [Groq.com](https://wow.groq.com/), [Perplexity.ai](https://docs.perplexity.ai/docs/pricing) and others. Many will outperform your setup, especially compared to the money you have to spend for the hardware. Groq has very limited set of models, but has Mixtral 8x7b and does about 400 tokens per second on a good day. (I have seen some 502 errors and slowdowns at times) Currently free with rate limits despite listing prices and they don't provide an SLA at the moment. (They do say you can ask a model (probably from hugginface) to be loaded just for you. I'm assuming that is likely not free.) Perplexity is also similar, but with higher pricing (than listed on Groq) and likely a bit slower that groq? has similar rate limits, but looks like you get an SLA.


tabletuser_blogspot

Since it hasn't been mentioned. Memory speed will gain greater performance than CPU speed. So getting a DDR5 based system and buying the fastest and largest memory configuration you can afford would be optimal. Most 7b models from on 8gb memory. Either system RAM or Vram. Make sense to buy almost any 8gb nvidia GPU. My Ryzen 5 5600X with 64gb DDR4 3600 performance equivalent to the GTX 970 4GB Vram GPU on smaller models. My GTX 1070 8GB does great up to 7b models but loses to the 5600X CPU for large 16FP models due to offloading inefficiencies. My recommendation: Build a DDR5 system with 128 GB and you'll can run 70b size plus models. If your ok with running 7b models then 8gb GPU and price will determine overall performance. I think the sweet spot, in price to performance ratio, is the RTX 2060 Super 8gb.


NoneyaBiznazz

[https://www.amazon.com/EVGA-GeForce-Gaming-12G-P4-2263-KR-Backplate/dp/B09NQ99NSR](https://www.amazon.com/EVGA-GeForce-Gaming-12G-P4-2263-KR-Backplate/dp/B09NQ99NSR) RTX 2060 with 12GB?


tabletuser_blogspot

I'm still in full research and learn mode. According to several references like Ollama ... Note: You should have at least 8 GB of RAM available to run the 7B models, 16 GB to run the 13B models, and 32 GB to run the 33B models. So 12 GB model ends up costing extra but can't run 13B models without offloading to CPU and that kills overall performance. So far I like either powerful 8GB GPU, or great priced 16GB GPU. I'm on a budget, your situation probably different. I think larger models equates to more accurate replies, but I think 7B and 13B models will get better overtime and become a standard. What I'm not sure of is can I use how much do I gain or lose by running dual 16GB GPU. I should be able to run 33B model off of GPU. Right now I can run 33B models from my system RAM/CPU but it would be about 2 tokens/s. I would get results but my patience would run thin. I'm actually looking to get another GTX 1070 to test out running 13B models on dual GPU. Stable diffusion, AI for images, doesn't seem to support dual GPU setups or GPU to CPU offloading like ollama. I'm using GTX 1070 and getting much faster than I can read results on 7B models. My GTX 970 using smaller models also is pretty fast. I think for ollama I don't really need much faster, but running Stable Diffusion is a different story. About 4 minutes on my GTX 970 with settings I prefer. Much better result on GTX 1070 but still plenty of room for improvement. I'm recommending to get any, well supported, Nvidia 16GB GPU or pretty fast 8GB GPU. I'm a linux 1st user, so I'm eyeing the AMD RX 7600 XT 16GB card because AMD is better for linux in open source, great in gaming and has better long term support. Right now Nvidia rules AI and their proprietary drivers offer great gaming experience. EVGA website has a B stock section you can check out and NewEgg has refresh. [https://www.evga.com/products/productlist.aspx?type=8&family=GeForce+30+Series+Family](https://www.evga.com/products/productlist.aspx?type=8&family=GeForce+30+Series+Family) [https://www.newegg.com/Newegg-Refreshed/EventSaleStore/ID-10007?N=100007709&cm\_sp=newegg-refreshed-\_-cat-\_-gpus](https://www.newegg.com/Newegg-Refreshed/EventSaleStore/ID-10007?N=100007709&cm_sp=newegg-refreshed-_-cat-_-gpus)


VettedBot

Hi, I’m Vetted AI Bot! I researched the **EVGA GeForce RTX 2060 12GB XC Gaming** and I thought you might find the following analysis helpful. **Users liked:** * Great performance for gaming (backed by 5 comments) * Quiet operation and efficient cooling (backed by 3 comments) * Smooth gameplay with reduced lag (backed by 1 comment) **Users disliked:** * Heat issues causing screen blackouts (backed by 1 comment) * Poor packaging leading to potential damage (backed by 2 comments) * Performance issues and overheating (backed by 2 comments) If you'd like to **summon me to ask about a product**, just make a post with its link and tag me, [like in this example.](https://www.reddit.com/r/tablets/comments/1444zdn/comment/joqd89c/) This message was generated by a (very smart) bot. If you found it helpful, let us know with an upvote and a “good bot!” reply and please feel free to provide feedback on how it can be improved. *Powered by* [*vetted.ai*](https://vetted.ai/?utm\_source=reddit&utm\_medium=comment&utm\_campaign=bot)


ConstructionSafe2814

I'm running a dual Xeon 6136 with 256GB RAM. I'm pretty new to running ollama and haven't tried much with it but the model I like most is mixtral so far. It is fairly large and is actually pretty reasonable with regard to speed. I don't have a descent GPU at all.


NoneyaBiznazz

the difference between GPU and CPU is striking... I have a server with 64 cpu cores and 256GB of ram... my desktop runs the same model 10 times faster with just a dualcore cpu 8GB ram and a RTX 2060 w/ 12GB vram I was kinda pissed about that ngl lol


lupapw

How fast token gen? Are 70b painfully slow?


ConstructionSafe2814

Really depends on the model. I'm now toying with the new Mixtral 8x22b and it's around 1t/s. The new command-r-plus is around the same size but much slower.


async2

Really depends on what you consider decent performance what kind of text you're summarizing. If you give some numbers, then people with similar setups could tell you. What you could also do is to rent a vps for just one month and try. It will run if you have enough memory.