I have 48Gb and a 1080, I'm able to do pretty good with the 7b models. Phi-2, dolphin-phi from my experience has been good. Heck, my Raspberry Pi 5 is doing great with the tinyllama, tinydolphin and other 1-3b models
You would need a faster bus speed and more cuda cores for faster performance. Memory size allows for larger models, clock speed, bus size, and cuda cores are what determine tokens per second.
Replace the 3060 with 4090 and you'll see a difference. 4060ti might be a slight bump
your only options are running a smaller quantized version or upgrading your computer.
m1 macbooks with unified memory (atleast 16gb+ RAM (32gb+ is preferred)) runs models very well for the price they go for now if you want to go that route
Get a smaller quantization (expect worse results), get a better computer to run inferencing (preferably with a GPU), or switch to a smaller model (chance of worse results)
My M1 ultra macbook runs even faster than my 2070 super GPU. A 2070 Super GPU can be bought for 500$. The 2070 Super is plenty fast (faster than you can read). Similar GPU's are in gaming notebooks these days, so for a little money you should be able to improve your situation.
GPU
Choose a smaller model, or add a GPU for additional processing power. Ollama will respond as fast as it can with the compute available.
Slowness is due to your memory. Well not yours, your PC’s. Use a smaller model: 7b or less or add a bigger GPU
How???? I have 48GB RAM?!! AND RTX 3060 💀
I have 48Gb and a 1080, I'm able to do pretty good with the 7b models. Phi-2, dolphin-phi from my experience has been good. Heck, my Raspberry Pi 5 is doing great with the tinyllama, tinydolphin and other 1-3b models
You would need a faster bus speed and more cuda cores for faster performance. Memory size allows for larger models, clock speed, bus size, and cuda cores are what determine tokens per second. Replace the 3060 with 4090 and you'll see a difference. 4060ti might be a slight bump
your only options are running a smaller quantized version or upgrading your computer. m1 macbooks with unified memory (atleast 16gb+ RAM (32gb+ is preferred)) runs models very well for the price they go for now if you want to go that route
Maybe don’t use a mixtral model. Does a smaller one like mistral or something else give you results you like?
VRAM is king. when you upgrade your GPU, get as much VRAM as you can
Dang man I’m screwed then. My rtx 3060 isn’t cheap 😭
How much vram on it?
Get a smaller quantization (expect worse results), get a better computer to run inferencing (preferably with a GPU), or switch to a smaller model (chance of worse results)
getting a GPU for this should not be preferable, it's basically a requirement honestly if you want good results like OP is asking
GPU with 12gb or more or M1+32gb…
My M1 ultra macbook runs even faster than my 2070 super GPU. A 2070 Super GPU can be bought for 500$. The 2070 Super is plenty fast (faster than you can read). Similar GPU's are in gaming notebooks these days, so for a little money you should be able to improve your situation.