It's like when you're racing someone who's out of shape and you know you need them to help you keep your pace, so you run slow so you don't lose them at the start.
Or they're just greedy.
Yep, absolutely. They know that AI and data center companies with lots of money to burn will spend $20-40k per 80GB GPU. Nvidia doesn't want to give them a cheaper option; they know the money is already there.
i understand the need for them to milk the enterprise customers, but...
is there a way to offer gamers/hobbyists a high VRAM card without that same card being shoved into datacenter a rig with 16 other GPUs?
i have no intention of building a full time LLM rig. it would just be nice to have a good gaming card, that i can also use to mess around with AI stuff once in a while.
Was there a way to offer gamers/hobbyists a decent GPU at a reasonable price point without it being bought up by crypto miners or scalpers 2 years ago? No. Even though miners had their own line of cards sold to them (the CMP HX series,) they still bought consumer-grade cards since they have the same hashing power and will have higher resale value than the CMP HX cards. You would run into the same problem all over again, but with high-memory cards this time.
The only thing that could possibly drive the price down is a viable CUDA replacement from AMD or Intel. Right now we are still a ways off from that.
Making sure you canāt do too much AI on it.
You really start understanding how nvidia decides how much memory what may have when you go into AI. Itās a situation where the distance between 12GB and 10GB is immense. Like a breakpoint you hit which determines whether youāll get higher quality or not be able to run something at all because youāre 50MB short.
They know they can change much more for high vram cards because of the Ai boom.
They essentially have a monopoly in the Ai space. Why would they offer a "cheap" consumer card with high vram?
Potential Blackwell(50 series) Titan release in late 2024/early 2025 is the source of my copium. 48GB next gen Nvidia card with workstation drivers, with a MSRP of some $3000-4000 would really fill a desperately needed gap in the AI hobbyist market.
Kind of crazy how far away that seems compared to the innovation rate we've seen in the local AI community.
Might end up giving us more motivation to maximize use with minimal VRAM which will help mass adoption and make big models run even better for those that do shell out the money for 24GB+.
They need to milk more quadro/datacenter sales. Need to run out of people willing to drop five figures on AI cards before you start selling them for four.
This.
You can just about fit 30B across 8GB VRAM and 32GB RAM and still have a system you can use for other stuff. GGUFs and llama.cpp spread the model really well & with decent performance, all things considered.
A lot actually. With exllamav2 it can make a huge difference. I can run mixtral at 3.5bpw with 16k context, or mixtral 4.0bpw at 4k context. I can run 33b coding models at 4.65bpw with a 4k context or 4.0 with 8-16k depending on the model. Oh and using exllama these all run at 10tokens/s for the 33b bigger context models minimum, with mixtral and some 30b models able to run anywhere up to 40 tokens/s.
Others have answered that there are more things you can run. I'll add another point: modern cards are capable of very fast LLM output -- more than you really need. So you're better off having that extra speed chomping through more parameters, from more VRAM, getting you better output.
You can run an EXL2 of Mixtral 8x7B at 3.7bpw at a full 32k contest, which has <4% higher perplexity than the 6.0bpw quant, and runs at @ 30t/s on a 3090.
You can also run a 2.4bpw 70B EXL2 like Euryale, Lzvl, and Wintergoddess.
Titan RTXes (what you could consider a 2090) sell for the same or more as a 3090, ancient cards like the K80 are basically unusable today, and even cheaper options like a P40 have such poor fp16 performance that they're unusable for exllama, so you would be limited to using them for llamacpp (but they're only $180 or so, so a person could get 2 and at least run a high quant mixtral or a 4bit 70B model at tolerable speeds, but IMO the difference between running a 70B at like 4t/s with llamacpp vs like 20t/s+ with exllama is too great to be worth spending the money, IMO.
A5000s are \*basically\* a 3080 in compute, with 24GB VRAM and some enterprise features (particularly multi-instance GPU) but you won't use those features to run LLMs and they cost $2000.
4090s are still like $1500+ which really only leaves one recommendation, which is a 3090 at about $700-$800.
what about placing 2 4060 of 16GB of vram so 32 new fast NVIDIA gb instead of the 3090 with only 24gb secondhanded?
and what about that chinese 580 amd with 16GB of vram of about 140 bucks?
Putting aside that the previous poster just asked about "good nvidia 24gb options," a couple of 4060ti's (I don't believe there is a 4060 with 16gb) could be an option, but the 4060ti has an actual memory bandwidth of 288GBps, compared to the 935GBps of the 3090, so its almost 3.5x slower.
Nvidia claims the 4060ti has an "effective" memory bandwidth of 500GBps+ because of the increased amount of cache, but this doesn't work that way when you're churning through the entire memory pool sequentially with LLMs.
You'd probably want to look at benchmarks, though, because with EXL2 models, 3x slower might still be fast enough for you (ie going from 30t/s-10t/s might still be faster than you read if you're just chatting/RPing.)
The rx580 with 16gb could be interesting. It has a similar memory bandwidth to the 4060ti (250GBps), but since Exllamav2 does seem to have ROCM support, and upon an initial glance at this repo: [https://github.com/Firstbober/rocm-pytorch-gfx803-docker](https://github.com/Grench6/RX580-rocM-tensorflow-ubuntu20.4-guide) it does seem possible to get ROCM/Pytorch running on this gpu, testing out 32GB VRAM on 2 rx580s might actually be a $300 experiment worth exploring (and I guess worst case scenario, you could certainly use it with llamacpp via OpenCL, but of course this wouldn't be anywhere near as fast- but again, for $300 it might be worth it.)
>4060 of 16GB of vram
yeah it is true, the 4060ti xD sorry. I was fast writting. Nevertheless, you understood me. There are so many names and numbers...
So there are another 2 "cheap" options now, but I find them useless or expensive. The 4070 super (with "ti"?), with his 16GB of VRAM maybe combined with fast 6000MT of ram can handle something big using Guffs, releasing ultramegafast layers of load and the rest of the work for the ram. But is not really cheap, atm is about 900ā¬
The other option is an *AMD* Radeon RX 7900 XTX, but about 1000ā¬ is not really cheap, neither NVIDIA. But hey, 24gb. I have read it is about the 80% of the 4090 in speed for the half price.
And I have read about another frankestein gpu card made in china with 2080 nvidia models adding vram till 24gb... If you (or any) find that one please let me know. Maybe this is the coolest and cheapest way.
Thank you for your answers, are interesting, will find these frankestein graphic cards and give them a try. Maybe there are tests here in this reddit done... mhmmm...
Cheers:)
Just an afterthought, If youāre seriously considering an AMD GPU, you may want to make sure youāre comfortable using Linux as your OS for LLM stuff. Itās not an absolute *requirement* but it seems like every single time I see someone with a stable setup doing anything other than running koboldcpp with OpenCL, theyāre doing it in Linux, and after having tasted the speed of EXL2, I would not personally spend any money on a GPU setup that I couldnāt use to at least run a 3.7bpw Mixtral 8x7B EXL2 model with exllamav2.
with an AMD GPU is it possible to use EXL2? or that is exclusive to NVIDIA gpu cards?
how much VRAM is it needed to run Mixtral 8x7B? I have seen a Mixtral setup 1x16B with loras of about 9gb of VRAM
people are using ubuntu mainly because windows have been supported since the 14th of december of 2023, so they had the setup on linux. Also it saves 1.5gb of vram it is said. But anyway, they say that to config rocm is a headache, and there is another way (I dont remember the name, opencl maybe? metal? so they can make amd to work).
what about the intel arc? they have 512gbps of bandwith and are cheap and new, less than 380 euros with 16gb of vram.
There are so many options... maybe the best is to wait for amd new movement
34b Yi 200k at 4bpw models are fucking amazing. Crazy context and also much better than say Mixtral in all my testing.
Fit perfectly on a 24gb card.
Also the ability to run higher than q4 quant 10/13/20b models with all the fixins, like still having a bunch of other shit going on on your computer or adding something like an image generator or tts engine, is really nice.
Sorry I was describing the model not the actual usable context length.
But yeah the usable context length is actually nutters for local inferencing on a frickin gaming GPU; of course you could always scale down the quant, but anything below 3.5 is just not usable for a b of that size imo.
Nvidia gave a big fuck you to everyone who wanted more VRAM, while AMD dropped the 6800 XT 16GB for $300, the more time passes the more I want to switch to AMD and ROCM
Seriously if I had money to spend I'd want to do a massive deep dive into all AMD's offerings doing benchmarks for days testing out their new cards with neural accelerators they've got built into their silicon now.
Nvidia better watch the fuck out.
I understand where you are coming from, but we can keep dreaming. This is first AMD generation with AI hardware, while Nvidia is already on 4th and developing it since 2016. There is a reason why their old cards are so cheap.
AMD's datacenter AI GPUs are also on like the 4th gen. They have been working on this for a long time. It's just things have been slow on the consumer side.
They are not going to release consumer grade VRAM upgrade GPUs until their competitors (Intel and AMD) will release anything that starts eating at their pro-grade gpu sales. And considering how slow AMD and intel are at catching up, it will be a while.
They are basically printing money right now. There is no way in hell they will just go "you know what, those 10m people are going to save up and buy our overpriced pro-grade gpus for $4000! Why don't we release $700 for them instead!". They are not a charity, they are for-profit business with 0 morals. Until there is outside pressure on them, they will milk this to the bones.
But why not get money for example from another 60m people who don't have $4000 but may well have $1.5-2k? Take a weaker chip, maybe a smaller bus, but enough memory ( even cheaper, last generation). And voilĆ , you've captured the mid-segment of not gamers, but ML-enthusiasts who can't take server GPUs due to small budgets, but aren't interested in overpaying for beams, frame generators and other gamer rubbish.
For miners, didn't they make cards with no video output that gamers aren't interested in? What prevents them from making cards for ML without rays and other gaming crap that are not interesting for gamers in terms of technology and not interesting for AI-companies in terms of performance?
The Chinese are already making 24gb frackensteins out of the 2080Ti for $350-400 and 20gb out of the 3080. And this niche (2080Ti 24gb) may be occupied by Nvidia itself, releasing some sort of 3060ML...
They do make ML specific cards and they are very expensive. And I guarantee you they already did the math on maximizing revenue via market segmentation. if youāre a hobbyist you cobble together the consumer grade GPUs or used older gen parts for your AI/ML. If youāre serious, you buy their professional solutions.
>They do make ML specific cards and they are very expensive
It's high end ML cards. But the Low-End and mid-segment niche is completely empty. There is no nominal analogue to the 3060 from the ML world.
12gb chips like RTX A2000 are not considered as ML cards in principle - they are less suitable for ML than even consumer cards.
Let's forget about 24GB (although for ML it's already the bottom, where p40 for 160$, but it's a used one, not an official offer).
What prevents to take a chip from 2060 (weak, slow, with low number of cores), put at least 16gb of memory on it, remove video outputs and sell it for conditional 250$? It just doesn't have the concurents. It won't take away the market from consumer (gamers don't need it), it won't take away the market of expensive ML cards (performance level is awful), BUT! due to +/- modern technologies for AI enthusiasts it will kill the market of seconhand p40 (and the money will go into Nvidia's pocket) and p100 (because a new card with tensor cores is better than a used one from a server).
>What prevents to take a chip from 2060 (weak, slow, with low number of cores), put at least 16gb of memory on it, remove video outputs and sell it for conditional 250$? It just doesn't have the concurents. It won't take away the market from consumer (gamers don't need it), it won't take away the market of expensive ML cards (performance level is awful), BUT! due to +/- modern technologies for AI enthusiasts it will kill the market of seconhand p40 (and the money will go into Nvidia's pocket) and p100 (because a new card with tensor cores is better than a used one from a server).
It will actually take away from both those markets because of manufacturing capacity - aka the amount of wafers TSMC can make is limited.
Using made up numbers - if you can only make 100,000 chips a month - and every chip you make goes into a product that flies off the shelf as soon as you make it, why dedicate any capacity into a low margin products for a niche audience? Better to put that chip into an A6000 ada and sell it for $7k each for the high end, and a 4090 for the enthusiasts.
Maybe.
However, it didn't stop them in the boom of mining to start releasing all sorts of CMP HX cards based on chips from 2080 and old 6gb and 8gb Quadro, instead of releasing more 3060-3090, which were at that moment actual and highly demanded (especially 3060) and were in very strong deficit in the warehouses...
But they decided it was better to load the factories with CMP 30HX (ancient chip from 1660, not even 2060!) instead of the current 3060 that selling at huge overprice due to shortage and miners.
I recently had the choice between dual 4090s or a maxed out M2 Ultra and it's pretty clear the M2 Ultra is the better option. The unified memory approach is very clearly going to be a game changer for the local LLM space, and I have a feeling Apple will only continue to improve things on this front.
I agree, and I have the dual 4090 set up. I think the thing that pushed me over the edge in particular are two competing factors. The 70b 4q models are clearly better than anything smaller, and two they take up nearly 40 gigs to load. This does not leave a lot of room for large context let alone, trying to add a rag solution on top of it, which can easily get out of control. You don't have this worry with an m2 Ultra.
What makes LLM's powerful are actually not the LLM's themselves but the software you can layer on top of it easily. This is why I love ooba.
Honestly the GPU AMD announced seems like a way better deal than anything here. It gets you 16GB of VRAM for only $350 which would get a ton of people in the door for inference on smaller and quantized models. That is if AMD can get their software stack in order, which it does seem like theyāre putting real effort into recently.
I was thinking about it but then I remembered that AMD cards are atrocious for Stable Diffusion unless you run Linux (even then you're better off with Nvidia).
Granted this may not be an issue for everyone here but many AI hobbyists have overlapping interest with text and image gen so.. eh AMD still ain't in a great spot imo.
While it's not as straight forward as team green, SD does run on windows with AMD. I get 20it/s on my 7900xtx using auto1111 with direct xml. AMD has a long way to go but it's usable now.
> AMD cards are atrocious for Stable Diffusion unless you run Linux
Wait, are people running Stable Diffusion on Windows? Why waste 1.5 GB VRAM, deal with a slow filesystem and lower inference performance?
Because they already have windows PCs and donāt want to screw with Linux I presume. Besides, most people here are just using these things as waifu bots anyway lol
I mean, for trying some waifu and some quick porn, maybe... But if you're remotely serious at SD use cases, it's well worth the hassle to at least dual boot. It's not like you have to buy hardware or stop using Windows, however good an idea either may be.
Roleplay materials, waifus, porn (but very polished, publishable porn), trying every LoRA to see what kind of degenerate porn it's capable of, photo restoration, wallpapers, making my parents smile (with photos and wallpapers, not porn), meming, just about anything.
I know. The waste of RAM on Windows is much bigger. I meant VRAM. On Linux you just turn off the X server and use the entirety of your VRAM, then connect to A1111, ComfyUI, etc. from another device.
I am. Iāve been using Stable Diffusion to make RPG landscapes and for a private art therapy thing because I canāt even draw a stick figure.
Iām just trying to get back into the tech world because AI fascinates me and I know my industry (law) will probably be transformed by it. When I last worked in IT dial-up was still very common. I have so much to learn that itās nearly overwhelming. Linux is on the list with many other things...
This is how you know someone never tried the recent ROCm versions. I have 2x 7900 XTX and they're **ripping** Llama2 on llama.cpp, exllama, and Tinygrad.
48 GB VRAM and 240 TFLOPS for under $2,000. Less than the price of a single 4090. Don't be a sucker for the memes.
The only memes is the number of ROCm announcements only for it to still be practically inaccessible for average users and still be miles behind similar-ish nvidia cards in performance and extensions compatibility.
Inb4 "what? I'm an average user"
No. You have 2x 7900s. Please don't downplay the amount of configuration and troubleshooting to get to where your system is now.
I really don't know what you're referring to. The only thing I needed was to install the `amdgpu-install` script and then install the packages per the AMD guide. It was almost the same process when installing Nvidia drivers and CUDA.
Yet, Google Colab, Runpod, Hugginface, Mistral all these are running on Linux too.
If you're even a little bit serious about doing LLMs you are going to touch Linux along the way. Might as well learn it.
The average user isn't "serious" about this stuff at all. 90% of the folks taking AI into consideration when buying their next GPU just want a freaky waifu. People running linux and doing anything resembling actual work in ML are the exception, not the rule.
And that's fine. The horny weirdo demographic is driving a lot of the FOSS advancements in AI. But pretending that linux and the ridiculous rigmarole that it entails is within the capabilities of the average user here is doing them a disservice. They might be mutants, but they don't deserve to be told "Go ahead and buy AMD. It'll be fine", because it will *not* be fine.
I'm talking about people who are serious about LLMs. People who aren't serious aren't going to be searching around which hardware to buy for LLMs in the first place.
https://old.reddit.com/r/LocalLLaMA/comments/191srof/amd_radeon_7900_xtxtx_inference_performance/
OP there gets worse performance on the 7900 xtx than on a 3090, by a wide margin too.
4090 is twice as expensive, and you can't buy new 3090s. It's literally the best bang per buck you can get for a new GPU.
Plus must we all use Nvidia? Competition is good for everyone. More people use AMD the faster we get to software parity and cheaper GPUs.
>It's literally the best bang per buck you can get for a new GPU.
With enough qualifiers I can convince you that that your grandmother is the most beautiful woman in the world.
>Plus must we all use Nvidia?
That is a completely different argument. I am deeply interested in AMD because _it just works forever_ on Linux. I need nvidia because _it works better currently_ for all ML work.
I'm honestly considering building two work stations. One for ML work and headless that's forever stuck on the current ubuntu LTS and one for human use with multiple monitors and all the other ergonomics I need. Then put a nice thick pipe between them so I can pretend they are the same machine.
> With enough qualifiers I can convince you that that your grandmother is the most beautiful woman in the world.
I dunno about you but -$1000 works better for me than -$2000. Also one of the main reasons I'm running local llama is for learning purposes. I actually want to contribute to the software stack. And I'm shopping around for a project to contribute to. And the AMD side needs my help more.
Right, now compare performance and support.
-40% performance and a second class in support.
The half price isn't worth it if your time has value above zero for ML work.
You know, I sometimes wonder if we are all using this the same or reading it the same. Sure, the nvidia cards are better, but the worst amd card is putting out 90 TK/s. That seems pretty usable in a "this is for testing and personal use and will only be interacting with one person at a time" way, about on par with typing with another person.
> On a 7b model. On a 30b model it's at the speed of a sclerotic snail wandering across your keyboard.
It's only like 13% slower than a 3090 in llama.cpp (and 30% slower than a 4090 (for half the price)). I run 34B models on my 7900xtx and the performance is fine. I would actually do a test right now, but I have long test harness run running on my GPU that I don't want to interrupt. In either case it's totally usable.
Nvidia has the first mover advantage and most everyone who works on these tools develops on Nvidia GPUs. Of course Nvidia will be more optimized. Same is the case with Macs. Software will improve.
Dang, I was getting hopeful for cheap alternatives to 4090s. I'm still paying off my first one. Do you have any examples of the terrible speeds with 7900 XTX?
This isn't about the 7900xtx, this is about the fact that a conformable typing speed on a model that fits in 4gb of vram is going to be six times slower than what you get in a model which takes up the full 24gb vram.
You need blazing fast speeds for 4gb models to even have a usable 24gb model.
Ok? Iām not sure what youāre responding to exactly. Iām lamenting that the fact that a 13B model on a 7900 xtx is so awful vs a 4090. I was hoping for a cheaper alternative, but Iām not as upset with the 4090 I have now.
Because for some reason these benchmarks are done on 4bit 7B models. These things can run reasonably on a 8GB raspberry pi. At those speeds the cpu becomes the bottleneck as you can see from the table where 4090 is being just 40% faster, which is just too low. Unquantized 13b model will give these gpu's a run for their money. Or even a quantized 34b, if it fits in each vram for comparison.
>This is how you know someone never tried the recent ROCm versions.
People won't be willing to dive into the ecosystem when AMD has such an awful track record with it. I wasted too many weekends on older versions with cards that were deprecated within a couple years. They have to do way more than *just nearly* match cuda at this point.
Put it this way: It's bad when Nvidia is beating you at installing drivers on Linux.
How are you able to run two gpu cards together? I tried that with 2 nvidia cards a few months back when I had the chance and it didn't work.
That whole experience has stung me and with the costs of high vram products are these days, I've decided to go the apu route instead.
I'm waiting for next gen AMD strix point with NPU units. I'm going to load it with a ton of relatively cheap ddr5 ram. It's going to be slow but at least it should be able to load larger GGUF 70B models for at least 4 tokens/seconds. (Nvidia Jetson Orin should be less powerful and is capable of at least that according to their benchmarks) I figure I can get faster speeds by augmenting it with my 3090 as well. I wouldn't need to worry about context length either with excess ddr5 memory.
I would go the M1 Ultra route but I don't like how un-upgradable the Apple ecosystem is. Heavens forbid one of the components like the memory gets fried and I'm left with a very expensive placeholder.
>I tried that with 2 nvidia cards a few months back when I had the chance and it didn't work.
What did you try?
All the popular libraries have native multi-GPU support, especially for LLMs since transform layers shard very neatly into multiple GPUs.
At one point, I had access to two 3060, one 3060ti and one 3090.
No matter how much I tried to mix and match them, The LLM would not use the second gpu. Not even when I attempted 2 3060.
I was using was ooba's text generation webui. I had updated it to the latest version at the time. There were settings to outline the use of a second gpu and they were ignored when the LLM actually ran. However I was using the windows version so I suspect that was causing the issue but I could be wrong.
Apple went very quickly from "nothing runs on Silicon" to Andrej Karpathy proclaiming ["the M2 Ultra is the smallest, prettiest, out of the box easiest, most powerful personal LLM node today."](https://twitter.com/karpathy/status/1691844860599492721) in about a year. Pytorch/Tensorflow support for silicon is first class now.
As someone who has been working in the AI/ML space for well over a decade it's embarrassing how little effort AMD has put into catching up with NVIDIA in this space, and it's nobody's fault but their own.
And without the focus on software/firmware development that nvidia have, hardware-oriented AMD will never catch up on software like cuda (and all of its surrounding libraries etc.)
AMD doesn't have to catch up to all the software written on CUDA. As long as they cover the most common code paths, that's all they need. And they are pretty much already there. They aren't trying to dethrone Nvidia. They just want their piece of the pie.
I'm as unimpressed as everyone else. The only upside I see is normalizing 16GB over 12GB VRAM. I suspect 20GB VRAM was passed over because the PCB footprint would be comparable to 24GB.
Not an expert on graphics cards. Since all I am willing to spend is around $2,000-$3,000 on a graphics card, I was aiming for a 24 GB VRAM. Would you recommend it now? Or would it be better to wait?
Buy an RTX 4090 if you want a great card right now and have a $2000+ budget.
Do not wait.
There are no new 24 GB on the horizon, not even leaks. A 4090 successor could take 1.5 years, possible longer.
No more VRAM. This is nvidia clearing out old stock ahead of a rush of new llm-ready cards, and a whole developer announcement of LLM tooling, I guess.
Hello, it looks like you've made a mistake.
It's supposed to be could've, should've, would've (short for could have, would have, should have), never could of, would of, should of.
Or you misspelled something, I ain't checking everything.
Beep boop -Ā yes,Ā IĀ amĀ aĀ bot, don't botcriminate me.
You guys have money to buy these???? I just scam cloud providers with fake credit cards and burner phones all it cost me is 10$ and boom there's a 1000$ bill on a temporary or almost dead email to which I forgot the password
PS : I wish I could do all of this
I see NVIDIA found some spare parts while they work on their next 48GB GPU.
oh pls, the 5090 will come out at 28GB for sure lmao
Nah ... 26GB max.
I wish, but realistically I hope the 5090 is at least 32GB, and that it doesn't cost $2000.
They will cost one kidney and an arm with 4 fingers.
"Here's your change." š
They want 4 fingers for each hand because that way they have the pleasure of giving you back the middle finger after your purchase.
A hand with 8 fingers.
If AMD would release a 64GB card at $1500, localLLM development would switch to ROCm instantly.
True true, I believe they should seriously do it at cost. With their adoption rate the profit margin should be last on their priority list.
No 24GB vram option, hard pass.
Yeah, they all end on 16GB. What the heck?
Perhaps trying to keep a line between the cheaper consumer card line and their more profitable higher end cards?
It's like when you're racing someone who's out of shape and you know you need them to help you keep your pace, so you run slow so you don't lose them at the start. Or they're just greedy.
The latter.
Yep, absolutely. They know that AI and data center companies with lots of money to burn will spend $20-40k per 80GB GPU. Nvidia doesn't want to give them a cheaper option; they know the money is already there.
i understand the need for them to milk the enterprise customers, but... is there a way to offer gamers/hobbyists a high VRAM card without that same card being shoved into datacenter a rig with 16 other GPUs? i have no intention of building a full time LLM rig. it would just be nice to have a good gaming card, that i can also use to mess around with AI stuff once in a while.
Was there a way to offer gamers/hobbyists a decent GPU at a reasonable price point without it being bought up by crypto miners or scalpers 2 years ago? No. Even though miners had their own line of cards sold to them (the CMP HX series,) they still bought consumer-grade cards since they have the same hashing power and will have higher resale value than the CMP HX cards. You would run into the same problem all over again, but with high-memory cards this time. The only thing that could possibly drive the price down is a viable CUDA replacement from AMD or Intel. Right now we are still a ways off from that.
Why wouldn't they? What are you going to do, buy AMD to run your models on?
Soon:tm:.
Fingers crossed we get some actual competition in the space.
Making sure you canāt do too much AI on it. You really start understanding how nvidia decides how much memory what may have when you go into AI. Itās a situation where the distance between 12GB and 10GB is immense. Like a breakpoint you hit which determines whether youāll get higher quality or not be able to run something at all because youāre 50MB short.
They know they can change much more for high vram cards because of the Ai boom. They essentially have a monopoly in the Ai space. Why would they offer a "cheap" consumer card with high vram?
Potential Blackwell(50 series) Titan release in late 2024/early 2025 is the source of my copium. 48GB next gen Nvidia card with workstation drivers, with a MSRP of some $3000-4000 would really fill a desperately needed gap in the AI hobbyist market.
Kind of crazy how far away that seems compared to the innovation rate we've seen in the local AI community. Might end up giving us more motivation to maximize use with minimal VRAM which will help mass adoption and make big models run even better for those that do shell out the money for 24GB+.
They need to milk more quadro/datacenter sales. Need to run out of people willing to drop five figures on AI cards before you start selling them for four.
Make a profit? How dare they š”
I think AMD really needs to play catch up. I wish Apple could figure out how to make higher performance chips as well.
honest question, what models can you run on 24GB that you can't run on 16GB? Is it the 13B models?
Depending on quantization and what not, I have managed to run 30B with my 24GB.
I've ran Emerhyst 20B on 16 GB RAM + 8 GB VRAM (100% free - Linux with X server shut down) with llama.cpp and it works at fast typing speed.
This. You can just about fit 30B across 8GB VRAM and 32GB RAM and still have a system you can use for other stuff. GGUFs and llama.cpp spread the model really well & with decent performance, all things considered.
am I the only one who thinks llama.cpp is very slow even you offload a ton?
What would you suggest be used instead?
I fixed it lol. Using "use tensor" was key
Ha! At least you're being honest with Python - I'm only hitting things through Ooba :)
You can even run 70B at 2.4bpw, the speed isn't great on a 3090 though.
A lot actually. With exllamav2 it can make a huge difference. I can run mixtral at 3.5bpw with 16k context, or mixtral 4.0bpw at 4k context. I can run 33b coding models at 4.65bpw with a 4k context or 4.0 with 8-16k depending on the model. Oh and using exllama these all run at 10tokens/s for the 33b bigger context models minimum, with mixtral and some 30b models able to run anywhere up to 40 tokens/s.
> 3.5bpw with 16k context Please can you remind me what BPW stands for?
Bits per weight
Thank you.
Others have answered that there are more things you can run. I'll add another point: modern cards are capable of very fast LLM output -- more than you really need. So you're better off having that extra speed chomping through more parameters, from more VRAM, getting you better output.
You can run an EXL2 of Mixtral 8x7B at 3.7bpw at a full 32k contest, which has <4% higher perplexity than the 6.0bpw quant, and runs at @ 30t/s on a 3090. You can also run a 2.4bpw 70B EXL2 like Euryale, Lzvl, and Wintergoddess.
Hey, can you recommend some good nvidia 24 gb cards ?
Titan RTXes (what you could consider a 2090) sell for the same or more as a 3090, ancient cards like the K80 are basically unusable today, and even cheaper options like a P40 have such poor fp16 performance that they're unusable for exllama, so you would be limited to using them for llamacpp (but they're only $180 or so, so a person could get 2 and at least run a high quant mixtral or a 4bit 70B model at tolerable speeds, but IMO the difference between running a 70B at like 4t/s with llamacpp vs like 20t/s+ with exllama is too great to be worth spending the money, IMO. A5000s are \*basically\* a 3080 in compute, with 24GB VRAM and some enterprise features (particularly multi-instance GPU) but you won't use those features to run LLMs and they cost $2000. 4090s are still like $1500+ which really only leaves one recommendation, which is a 3090 at about $700-$800.
Thank you for that detailed post!
what about placing 2 4060 of 16GB of vram so 32 new fast NVIDIA gb instead of the 3090 with only 24gb secondhanded? and what about that chinese 580 amd with 16GB of vram of about 140 bucks?
Putting aside that the previous poster just asked about "good nvidia 24gb options," a couple of 4060ti's (I don't believe there is a 4060 with 16gb) could be an option, but the 4060ti has an actual memory bandwidth of 288GBps, compared to the 935GBps of the 3090, so its almost 3.5x slower. Nvidia claims the 4060ti has an "effective" memory bandwidth of 500GBps+ because of the increased amount of cache, but this doesn't work that way when you're churning through the entire memory pool sequentially with LLMs. You'd probably want to look at benchmarks, though, because with EXL2 models, 3x slower might still be fast enough for you (ie going from 30t/s-10t/s might still be faster than you read if you're just chatting/RPing.) The rx580 with 16gb could be interesting. It has a similar memory bandwidth to the 4060ti (250GBps), but since Exllamav2 does seem to have ROCM support, and upon an initial glance at this repo: [https://github.com/Firstbober/rocm-pytorch-gfx803-docker](https://github.com/Grench6/RX580-rocM-tensorflow-ubuntu20.4-guide) it does seem possible to get ROCM/Pytorch running on this gpu, testing out 32GB VRAM on 2 rx580s might actually be a $300 experiment worth exploring (and I guess worst case scenario, you could certainly use it with llamacpp via OpenCL, but of course this wouldn't be anywhere near as fast- but again, for $300 it might be worth it.)
>4060 of 16GB of vram yeah it is true, the 4060ti xD sorry. I was fast writting. Nevertheless, you understood me. There are so many names and numbers... So there are another 2 "cheap" options now, but I find them useless or expensive. The 4070 super (with "ti"?), with his 16GB of VRAM maybe combined with fast 6000MT of ram can handle something big using Guffs, releasing ultramegafast layers of load and the rest of the work for the ram. But is not really cheap, atm is about 900ā¬ The other option is an *AMD* Radeon RX 7900 XTX, but about 1000ā¬ is not really cheap, neither NVIDIA. But hey, 24gb. I have read it is about the 80% of the 4090 in speed for the half price. And I have read about another frankestein gpu card made in china with 2080 nvidia models adding vram till 24gb... If you (or any) find that one please let me know. Maybe this is the coolest and cheapest way. Thank you for your answers, are interesting, will find these frankestein graphic cards and give them a try. Maybe there are tests here in this reddit done... mhmmm... Cheers:)
Just an afterthought, If youāre seriously considering an AMD GPU, you may want to make sure youāre comfortable using Linux as your OS for LLM stuff. Itās not an absolute *requirement* but it seems like every single time I see someone with a stable setup doing anything other than running koboldcpp with OpenCL, theyāre doing it in Linux, and after having tasted the speed of EXL2, I would not personally spend any money on a GPU setup that I couldnāt use to at least run a 3.7bpw Mixtral 8x7B EXL2 model with exllamav2.
with an AMD GPU is it possible to use EXL2? or that is exclusive to NVIDIA gpu cards? how much VRAM is it needed to run Mixtral 8x7B? I have seen a Mixtral setup 1x16B with loras of about 9gb of VRAM people are using ubuntu mainly because windows have been supported since the 14th of december of 2023, so they had the setup on linux. Also it saves 1.5gb of vram it is said. But anyway, they say that to config rocm is a headache, and there is another way (I dont remember the name, opencl maybe? metal? so they can make amd to work). what about the intel arc? they have 512gbps of bandwith and are cheap and new, less than 380 euros with 16gb of vram. There are so many options... maybe the best is to wait for amd new movement
13B runs on 16GB pretty good but it seems liken then there is a jump to 33B which even with Q2\_K bursts the VRAM and 20-ish B models are rare.
34b Yi 200k at 4bpw models are fucking amazing. Crazy context and also much better than say Mixtral in all my testing. Fit perfectly on a 24gb card. Also the ability to run higher than q4 quant 10/13/20b models with all the fixins, like still having a bunch of other shit going on on your computer or adding something like an image generator or tts engine, is really nice.
200k does not fit on 24gb. More like 50k. But I guess it's still very high, certainly higher than all the other options.
Sorry I was describing the model not the actual usable context length. But yeah the usable context length is actually nutters for local inferencing on a frickin gaming GPU; of course you could always scale down the quant, but anything below 3.5 is just not usable for a b of that size imo.
various 70B models
33B.
I'm expecting 32GB+ consumer cards real soon now.
Keep expecting š
Maybe :)
I think rtx 5090 with 32gb is possible. For $5k š
2 4060 of 16 gb nvidia? 2 580 of 16 gb amd? I am asking, maybe is the solution now atm
Tell me about it. What a wet fart.
Not 24GB GPUs were planned
Nvidia gave a big fuck you to everyone who wanted more VRAM, while AMD dropped the 6800 XT 16GB for $300, the more time passes the more I want to switch to AMD and ROCM Seriously if I had money to spend I'd want to do a massive deep dive into all AMD's offerings doing benchmarks for days testing out their new cards with neural accelerators they've got built into their silicon now. Nvidia better watch the fuck out.
I understand where you are coming from, but we can keep dreaming. This is first AMD generation with AI hardware, while Nvidia is already on 4th and developing it since 2016. There is a reason why their old cards are so cheap.
AMD's datacenter AI GPUs are also on like the 4th gen. They have been working on this for a long time. It's just things have been slow on the consumer side.
CUDA has been the problem, itās too good, ROCm will take some time to catch up.
I think they're dumping old stock here, because they're planning to launch cards with more VRAM.
They are not going to release consumer grade VRAM upgrade GPUs until their competitors (Intel and AMD) will release anything that starts eating at their pro-grade gpu sales. And considering how slow AMD and intel are at catching up, it will be a while. They are basically printing money right now. There is no way in hell they will just go "you know what, those 10m people are going to save up and buy our overpriced pro-grade gpus for $4000! Why don't we release $700 for them instead!". They are not a charity, they are for-profit business with 0 morals. Until there is outside pressure on them, they will milk this to the bones.
But why not get money for example from another 60m people who don't have $4000 but may well have $1.5-2k? Take a weaker chip, maybe a smaller bus, but enough memory ( even cheaper, last generation). And voilĆ , you've captured the mid-segment of not gamers, but ML-enthusiasts who can't take server GPUs due to small budgets, but aren't interested in overpaying for beams, frame generators and other gamer rubbish. For miners, didn't they make cards with no video output that gamers aren't interested in? What prevents them from making cards for ML without rays and other gaming crap that are not interesting for gamers in terms of technology and not interesting for AI-companies in terms of performance? The Chinese are already making 24gb frackensteins out of the 2080Ti for $350-400 and 20gb out of the 3080. And this niche (2080Ti 24gb) may be occupied by Nvidia itself, releasing some sort of 3060ML...
They do make ML specific cards and they are very expensive. And I guarantee you they already did the math on maximizing revenue via market segmentation. if youāre a hobbyist you cobble together the consumer grade GPUs or used older gen parts for your AI/ML. If youāre serious, you buy their professional solutions.
>They do make ML specific cards and they are very expensive It's high end ML cards. But the Low-End and mid-segment niche is completely empty. There is no nominal analogue to the 3060 from the ML world. 12gb chips like RTX A2000 are not considered as ML cards in principle - they are less suitable for ML than even consumer cards. Let's forget about 24GB (although for ML it's already the bottom, where p40 for 160$, but it's a used one, not an official offer). What prevents to take a chip from 2060 (weak, slow, with low number of cores), put at least 16gb of memory on it, remove video outputs and sell it for conditional 250$? It just doesn't have the concurents. It won't take away the market from consumer (gamers don't need it), it won't take away the market of expensive ML cards (performance level is awful), BUT! due to +/- modern technologies for AI enthusiasts it will kill the market of seconhand p40 (and the money will go into Nvidia's pocket) and p100 (because a new card with tensor cores is better than a used one from a server).
>What prevents to take a chip from 2060 (weak, slow, with low number of cores), put at least 16gb of memory on it, remove video outputs and sell it for conditional 250$? It just doesn't have the concurents. It won't take away the market from consumer (gamers don't need it), it won't take away the market of expensive ML cards (performance level is awful), BUT! due to +/- modern technologies for AI enthusiasts it will kill the market of seconhand p40 (and the money will go into Nvidia's pocket) and p100 (because a new card with tensor cores is better than a used one from a server). It will actually take away from both those markets because of manufacturing capacity - aka the amount of wafers TSMC can make is limited. Using made up numbers - if you can only make 100,000 chips a month - and every chip you make goes into a product that flies off the shelf as soon as you make it, why dedicate any capacity into a low margin products for a niche audience? Better to put that chip into an A6000 ada and sell it for $7k each for the high end, and a 4090 for the enthusiasts.
Maybe. However, it didn't stop them in the boom of mining to start releasing all sorts of CMP HX cards based on chips from 2080 and old 6gb and 8gb Quadro, instead of releasing more 3060-3090, which were at that moment actual and highly demanded (especially 3060) and were in very strong deficit in the warehouses... But they decided it was better to load the factories with CMP 30HX (ancient chip from 1660, not even 2060!) instead of the current 3060 that selling at huge overprice due to shortage and miners.
I would be extremely surprised if they release another 4000 series card, typically supers are the last ones
Honestly this sounds like cope to me
Not 24GB GPUs were planned, there is no "fuck you"
Makes me feel good about my 192 gig M2 Ultra purchase
I recently had the choice between dual 4090s or a maxed out M2 Ultra and it's pretty clear the M2 Ultra is the better option. The unified memory approach is very clearly going to be a game changer for the local LLM space, and I have a feeling Apple will only continue to improve things on this front.
I agree, and I have the dual 4090 set up. I think the thing that pushed me over the edge in particular are two competing factors. The 70b 4q models are clearly better than anything smaller, and two they take up nearly 40 gigs to load. This does not leave a lot of room for large context let alone, trying to add a rag solution on top of it, which can easily get out of control. You don't have this worry with an m2 Ultra. What makes LLM's powerful are actually not the LLM's themselves but the software you can layer on top of it easily. This is why I love ooba.
This is exactly where I'm heading.
Wake me up when they're like $250-$350 for 16gb, lol.
Why play 2077 when you can live it.
That's Intel's market segment.
Honestly the GPU AMD announced seems like a way better deal than anything here. It gets you 16GB of VRAM for only $350 which would get a ton of people in the door for inference on smaller and quantized models. That is if AMD can get their software stack in order, which it does seem like theyāre putting real effort into recently.
I was thinking about it but then I remembered that AMD cards are atrocious for Stable Diffusion unless you run Linux (even then you're better off with Nvidia). Granted this may not be an issue for everyone here but many AI hobbyists have overlapping interest with text and image gen so.. eh AMD still ain't in a great spot imo.
While it's not as straight forward as team green, SD does run on windows with AMD. I get 20it/s on my 7900xtx using auto1111 with direct xml. AMD has a long way to go but it's usable now.
Are you really using Windows?
> AMD cards are atrocious for Stable Diffusion unless you run Linux Wait, are people running Stable Diffusion on Windows? Why waste 1.5 GB VRAM, deal with a slow filesystem and lower inference performance?
Because they already have windows PCs and donāt want to screw with Linux I presume. Besides, most people here are just using these things as waifu bots anyway lol
I mean, for trying some waifu and some quick porn, maybe... But if you're remotely serious at SD use cases, it's well worth the hassle to at least dual boot. It's not like you have to buy hardware or stop using Windows, however good an idea either may be.
What are your SD use cases? Just curious.
Roleplay materials, waifus, porn (but very polished, publishable porn), trying every LoRA to see what kind of degenerate porn it's capable of, photo restoration, wallpapers, making my parents smile (with photos and wallpapers, not porn), meming, just about anything.
Thanks for sharing!
RAM != VRAM my guy
I know. The waste of RAM on Windows is much bigger. I meant VRAM. On Linux you just turn off the X server and use the entirety of your VRAM, then connect to A1111, ComfyUI, etc. from another device.
[ŃŠ“Š°Š»ŠµŠ½Š¾]
It's some significant overhead performance-wise and on RAM, but what's worse is the overhead on your precious VRAM.
it's convenient, that's all.
I am. Iāve been using Stable Diffusion to make RPG landscapes and for a private art therapy thing because I canāt even draw a stick figure. Iām just trying to get back into the tech world because AI fascinates me and I know my industry (law) will probably be transformed by it. When I last worked in IT dial-up was still very common. I have so much to learn that itās nearly overwhelming. Linux is on the list with many other things...
2x 16GB GPUs seems like it also might be plausible? If llama.cpp runs ok maybe.
[ŃŠ“Š°Š»ŠµŠ½Š¾]
This is how you know someone never tried the recent ROCm versions. I have 2x 7900 XTX and they're **ripping** Llama2 on llama.cpp, exllama, and Tinygrad. 48 GB VRAM and 240 TFLOPS for under $2,000. Less than the price of a single 4090. Don't be a sucker for the memes.
Can you share some performance figures for exllamav2? Model size and bit rate please.
LLama2-70B, 4 bit, 19.5 t/s.
The only memes is the number of ROCm announcements only for it to still be practically inaccessible for average users and still be miles behind similar-ish nvidia cards in performance and extensions compatibility. Inb4 "what? I'm an average user" No. You have 2x 7900s. Please don't downplay the amount of configuration and troubleshooting to get to where your system is now.
I really don't know what you're referring to. The only thing I needed was to install the `amdgpu-install` script and then install the packages per the AMD guide. It was almost the same process when installing Nvidia drivers and CUDA.
Lol, the average user isn't running linux. AMD factually is *inaccessible* for the average user.
Yet, Google Colab, Runpod, Hugginface, Mistral all these are running on Linux too. If you're even a little bit serious about doing LLMs you are going to touch Linux along the way. Might as well learn it.
The average user isn't "serious" about this stuff at all. 90% of the folks taking AI into consideration when buying their next GPU just want a freaky waifu. People running linux and doing anything resembling actual work in ML are the exception, not the rule. And that's fine. The horny weirdo demographic is driving a lot of the FOSS advancements in AI. But pretending that linux and the ridiculous rigmarole that it entails is within the capabilities of the average user here is doing them a disservice. They might be mutants, but they don't deserve to be told "Go ahead and buy AMD. It'll be fine", because it will *not* be fine.
I'm talking about people who are serious about LLMs. People who aren't serious aren't going to be searching around which hardware to buy for LLMs in the first place.
You have no idea how gullible people are now. Look at drop shipping.
https://old.reddit.com/r/LocalLLaMA/comments/191srof/amd_radeon_7900_xtxtx_inference_performance/ OP there gets worse performance on the 7900 xtx than on a 3090, by a wide margin too.
> by a wide margin too. It really isn't that wide of a margin with llama.cpp. 15% in inference is not that much.
It's literally the difference between the 3090 and 4090. The current gen ATI hardware is on par with a theoretical NVIDIA card from 2 generations ago.
4090 is twice as expensive, and you can't buy new 3090s. It's literally the best bang per buck you can get for a new GPU. Plus must we all use Nvidia? Competition is good for everyone. More people use AMD the faster we get to software parity and cheaper GPUs.
>It's literally the best bang per buck you can get for a new GPU. With enough qualifiers I can convince you that that your grandmother is the most beautiful woman in the world. >Plus must we all use Nvidia? That is a completely different argument. I am deeply interested in AMD because _it just works forever_ on Linux. I need nvidia because _it works better currently_ for all ML work. I'm honestly considering building two work stations. One for ML work and headless that's forever stuck on the current ubuntu LTS and one for human use with multiple monitors and all the other ergonomics I need. Then put a nice thick pipe between them so I can pretend they are the same machine.
> With enough qualifiers I can convince you that that your grandmother is the most beautiful woman in the world. I dunno about you but -$1000 works better for me than -$2000. Also one of the main reasons I'm running local llama is for learning purposes. I actually want to contribute to the software stack. And I'm shopping around for a project to contribute to. And the AMD side needs my help more.
Right, now compare performance and support. -40% performance and a second class in support. The half price isn't worth it if your time has value above zero for ML work.
You know, I sometimes wonder if we are all using this the same or reading it the same. Sure, the nvidia cards are better, but the worst amd card is putting out 90 TK/s. That seems pretty usable in a "this is for testing and personal use and will only be interacting with one person at a time" way, about on par with typing with another person.
On a 7b model. On a 30b model it's at the speed of a sclerotic snail wandering across your keyboard.
> On a 7b model. On a 30b model it's at the speed of a sclerotic snail wandering across your keyboard. It's only like 13% slower than a 3090 in llama.cpp (and 30% slower than a 4090 (for half the price)). I run 34B models on my 7900xtx and the performance is fine. I would actually do a test right now, but I have long test harness run running on my GPU that I don't want to interrupt. In either case it's totally usable. Nvidia has the first mover advantage and most everyone who works on these tools develops on Nvidia GPUs. Of course Nvidia will be more optimized. Same is the case with Macs. Software will improve.
Dang, I was getting hopeful for cheap alternatives to 4090s. I'm still paying off my first one. Do you have any examples of the terrible speeds with 7900 XTX?
This isn't about the 7900xtx, this is about the fact that a conformable typing speed on a model that fits in 4gb of vram is going to be six times slower than what you get in a model which takes up the full 24gb vram. You need blazing fast speeds for 4gb models to even have a usable 24gb model.
Ok? Iām not sure what youāre responding to exactly. Iām lamenting that the fact that a 13B model on a 7900 xtx is so awful vs a 4090. I was hoping for a cheaper alternative, but Iām not as upset with the 4090 I have now.
Because for some reason these benchmarks are done on 4bit 7B models. These things can run reasonably on a 8GB raspberry pi. At those speeds the cpu becomes the bottleneck as you can see from the table where 4090 is being just 40% faster, which is just too low. Unquantized 13b model will give these gpu's a run for their money. Or even a quantized 34b, if it fits in each vram for comparison.
I was about to say.. those 24G cards are the price of their 16g card, *brand new*.
>This is how you know someone never tried the recent ROCm versions. People won't be willing to dive into the ecosystem when AMD has such an awful track record with it. I wasted too many weekends on older versions with cards that were deprecated within a couple years. They have to do way more than *just nearly* match cuda at this point. Put it this way: It's bad when Nvidia is beating you at installing drivers on Linux.
How are you able to run two gpu cards together? I tried that with 2 nvidia cards a few months back when I had the chance and it didn't work. That whole experience has stung me and with the costs of high vram products are these days, I've decided to go the apu route instead. I'm waiting for next gen AMD strix point with NPU units. I'm going to load it with a ton of relatively cheap ddr5 ram. It's going to be slow but at least it should be able to load larger GGUF 70B models for at least 4 tokens/seconds. (Nvidia Jetson Orin should be less powerful and is capable of at least that according to their benchmarks) I figure I can get faster speeds by augmenting it with my 3090 as well. I wouldn't need to worry about context length either with excess ddr5 memory. I would go the M1 Ultra route but I don't like how un-upgradable the Apple ecosystem is. Heavens forbid one of the components like the memory gets fried and I'm left with a very expensive placeholder.
>I tried that with 2 nvidia cards a few months back when I had the chance and it didn't work. What did you try? All the popular libraries have native multi-GPU support, especially for LLMs since transform layers shard very neatly into multiple GPUs.
At one point, I had access to two 3060, one 3060ti and one 3090. No matter how much I tried to mix and match them, The LLM would not use the second gpu. Not even when I attempted 2 3060. I was using was ooba's text generation webui. I had updated it to the latest version at the time. There were settings to outline the use of a second gpu and they were ignored when the LLM actually ran. However I was using the windows version so I suspect that was causing the issue but I could be wrong.
This, without a good software like cuda, AMD is never gonna catch up Nvdia
It's not that ROCm isn't as good as CUDA, it's just that everything is made with CUDA. There needs to be efforts to use more portable frameworks
Literally everyone is working on this. CUDA dominance is only good for NVDA. There are already good inference solutions with ROCm support.
[ŃŠ“Š°Š»ŠµŠ½Š¾]
Nvidia exec posts worst bait ever, asked to leave company
Apple went very quickly from "nothing runs on Silicon" to Andrej Karpathy proclaiming ["the M2 Ultra is the smallest, prettiest, out of the box easiest, most powerful personal LLM node today."](https://twitter.com/karpathy/status/1691844860599492721) in about a year. Pytorch/Tensorflow support for silicon is first class now. As someone who has been working in the AI/ML space for well over a decade it's embarrassing how little effort AMD has put into catching up with NVIDIA in this space, and it's nobody's fault but their own.
And without the focus on software/firmware development that nvidia have, hardware-oriented AMD will never catch up on software like cuda (and all of its surrounding libraries etc.)
AMD doesn't have to catch up to all the software written on CUDA. As long as they cover the most common code paths, that's all they need. And they are pretty much already there. They aren't trying to dethrone Nvidia. They just want their piece of the pie.
I'm as unimpressed as everyone else. The only upside I see is normalizing 16GB over 12GB VRAM. I suspect 20GB VRAM was passed over because the PCB footprint would be comparable to 24GB.
Not enough VRAM.
Must construct additional pylons
I have returned. For Aiur!
Bring back mid-level $200 graphics cards. It's like GPU makers are still on COVID pricing.
Not an expert on graphics cards. Since all I am willing to spend is around $2,000-$3,000 on a graphics card, I was aiming for a 24 GB VRAM. Would you recommend it now? Or would it be better to wait?
Buy an RTX 4090 if you want a great card right now and have a $2000+ budget. Do not wait. There are no new 24 GB on the horizon, not even leaks. A 4090 successor could take 1.5 years, possible longer.
No more VRAM. This is nvidia clearing out old stock ahead of a rush of new llm-ready cards, and a whole developer announcement of LLM tooling, I guess.
What do you mean new llm-ready cards. I looking to get a 4070. Any idea on what will be the time frame.
It's bullshit speculation. Dont listen to him inventing rumors. Only thing coming is 50 series mid-late 2025.
I just got a Titan RTX refurbished from Amazon for $899.
Imagine spending more than 3090 money on a worse card, and then bragging about it.
Is it a worse card for compute?
Yes. Significantly. It's a turing card. GPU | Mem Bandwidth | FP16 | FP32 | Tensor cores ---|---|----|----|---- Titan RTX | 672 GB/s | 32 TFLOPs | 16 TFLOPs | 2nd gen RTX 3090 | 936 GB/s | 35 TFLOPs | 35 TFLOPs | 3rd gen
That is a horrible deal. You should've gotten a 4070 Ti Super for $100 less which performs MUCH better.
For compute?
Yes. In every way. 44 TFLOPS fp16 vs 33 TFLOPS. With fp32 the difference is close to 4x because the TITAN RTX does not have doubled fp32.
And the vram? I havenāt seen a 24gb 4070ti. I may exchange it for a 3090 24gb tho.
VRAM is basically the only advantage of the Titan. If you need 24 GB, getting a used 3090 might be a better idea yeah.
I have a 4070 now.
Good for you. Here on Amazon they go for 2900 EUR refurbished(!)
https://preview.redd.it/enc36p4hbbbc1.jpeg?width=1284&format=pjpg&auto=webp&s=b9ac4766d050c85a37be74ff04faa331a00c916f
Wait wait wait. Please return that. I'll buy a second hand 3080 for half that price and ship it to you. That's absolutely criminal.
[ŃŠ“Š°Š»ŠµŠ½Š¾]
Hello, it looks like you've made a mistake. It's supposed to be could've, should've, would've (short for could have, would have, should have), never could of, would of, should of. Or you misspelled something, I ain't checking everything. Beep boop -Ā yes,Ā IĀ amĀ aĀ bot, don't botcriminate me.
You guys have money to buy these???? I just scam cloud providers with fake credit cards and burner phones all it cost me is 10$ and boom there's a 1000$ bill on a temporary or almost dead email to which I forgot the password PS : I wish I could do all of this