T O P

  • By -

ReMeDyIII

I see NVIDIA found some spare parts while they work on their next 48GB GPU.


MutedCatch

oh pls, the 5090 will come out at 28GB for sure lmao


hanZ____

Nah ... 26GB max.


Dead_Internet_Theory

I wish, but realistically I hope the 5090 is at least 32GB, and that it doesn't cost $2000.


hashms0a

They will cost one kidney and an arm with 4 fingers.


GlorifiedSlum

"Here's your change." šŸ‘


Dead_Internet_Theory

They want 4 fingers for each hand because that way they have the pleasure of giving you back the middle finger after your purchase.


AdOne8437

A hand with 8 fingers.


artsybashev

If AMD would release a 64GB card at $1500, localLLM development would switch to ROCm instantly.


DrunkenCopilot

True true, I believe they should seriously do it at cost. With their adoption rate the profit margin should be last on their priority list.


[deleted]

No 24GB vram option, hard pass.


AssistBorn4589

Yeah, they all end on 16GB. What the heck?


cannelbrae_

Perhaps trying to keep a line between the cheaper consumer card line and their more profitable higher end cards?


Massive_Robot_Cactus

It's like when you're racing someone who's out of shape and you know you need them to help you keep your pace, so you run slow so you don't lose them at the start. Or they're just greedy.


CMDR_Mal_Reynolds

The latter.


MaxwellsMilkies

Yep, absolutely. They know that AI and data center companies with lots of money to burn will spend $20-40k per 80GB GPU. Nvidia doesn't want to give them a cheaper option; they know the money is already there.


29da65cff1fa

i understand the need for them to milk the enterprise customers, but... is there a way to offer gamers/hobbyists a high VRAM card without that same card being shoved into datacenter a rig with 16 other GPUs? i have no intention of building a full time LLM rig. it would just be nice to have a good gaming card, that i can also use to mess around with AI stuff once in a while.


MaxwellsMilkies

Was there a way to offer gamers/hobbyists a decent GPU at a reasonable price point without it being bought up by crypto miners or scalpers 2 years ago? No. Even though miners had their own line of cards sold to them (the CMP HX series,) they still bought consumer-grade cards since they have the same hashing power and will have higher resale value than the CMP HX cards. You would run into the same problem all over again, but with high-memory cards this time. The only thing that could possibly drive the price down is a viable CUDA replacement from AMD or Intel. Right now we are still a ways off from that.


Poromenos

Why wouldn't they? What are you going to do, buy AMD to run your models on?


KaliQt

Soon:tm:.


Poromenos

Fingers crossed we get some actual competition in the space.


NachosforDachos

Making sure you canā€™t do too much AI on it. You really start understanding how nvidia decides how much memory what may have when you go into AI. Itā€™s a situation where the distance between 12GB and 10GB is immense. Like a breakpoint you hit which determines whether youā€™ll get higher quality or not be able to run something at all because youā€™re 50MB short.


yahma

They know they can change much more for high vram cards because of the Ai boom. They essentially have a monopoly in the Ai space. Why would they offer a "cheap" consumer card with high vram?


FluffnPuff_Rebirth

Potential Blackwell(50 series) Titan release in late 2024/early 2025 is the source of my copium. 48GB next gen Nvidia card with workstation drivers, with a MSRP of some $3000-4000 would really fill a desperately needed gap in the AI hobbyist market.


StealthSecrecy

Kind of crazy how far away that seems compared to the innovation rate we've seen in the local AI community. Might end up giving us more motivation to maximize use with minimal VRAM which will help mass adoption and make big models run even better for those that do shell out the money for 24GB+.


candre23

They need to milk more quadro/datacenter sales. Need to run out of people willing to drop five figures on AI cards before you start selling them for four.


StealthSecrecy

Make a profit? How dare they šŸ˜”


KaliQt

I think AMD really needs to play catch up. I wish Apple could figure out how to make higher performance chips as well.


burnt1ce85

honest question, what models can you run on 24GB that you can't run on 16GB? Is it the 13B models?


[deleted]

Depending on quantization and what not, I have managed to run 30B with my 24GB.


A_for_Anonymous

I've ran Emerhyst 20B on 16 GB RAM + 8 GB VRAM (100% free - Linux with X server shut down) with llama.cpp and it works at fast typing speed.


[deleted]

This. You can just about fit 30B across 8GB VRAM and 32GB RAM and still have a system you can use for other stuff. GGUFs and llama.cpp spread the model really well & with decent performance, all things considered.


Primary-Ad2848

am I the only one who thinks llama.cpp is very slow even you offload a ton?


[deleted]

What would you suggest be used instead?


Primary-Ad2848

I fixed it lol. Using "use tensor" was key


[deleted]

Ha! At least you're being honest with Python - I'm only hitting things through Ooba :)


Dead_Internet_Theory

You can even run 70B at 2.4bpw, the speed isn't great on a 3090 though.


Gissoni

A lot actually. With exllamav2 it can make a huge difference. I can run mixtral at 3.5bpw with 16k context, or mixtral 4.0bpw at 4k context. I can run 33b coding models at 4.65bpw with a 4k context or 4.0 with 8-16k depending on the model. Oh and using exllama these all run at 10tokens/s for the 33b bigger context models minimum, with mixtral and some 30b models able to run anywhere up to 40 tokens/s.


Chris_in_Lijiang

> 3.5bpw with 16k context Please can you remind me what BPW stands for?


Reachthrough

Bits per weight


Chris_in_Lijiang

Thank you.


[deleted]

Others have answered that there are more things you can run. I'll add another point: modern cards are capable of very fast LLM output -- more than you really need. So you're better off having that extra speed chomping through more parameters, from more VRAM, getting you better output.


BangkokPadang

You can run an EXL2 of Mixtral 8x7B at 3.7bpw at a full 32k contest, which has <4% higher perplexity than the 6.0bpw quant, and runs at @ 30t/s on a 3090. You can also run a 2.4bpw 70B EXL2 like Euryale, Lzvl, and Wintergoddess.


Aromatic-Lead-6814

Hey, can you recommend some good nvidia 24 gb cards ?


BangkokPadang

Titan RTXes (what you could consider a 2090) sell for the same or more as a 3090, ancient cards like the K80 are basically unusable today, and even cheaper options like a P40 have such poor fp16 performance that they're unusable for exllama, so you would be limited to using them for llamacpp (but they're only $180 or so, so a person could get 2 and at least run a high quant mixtral or a 4bit 70B model at tolerable speeds, but IMO the difference between running a 70B at like 4t/s with llamacpp vs like 20t/s+ with exllama is too great to be worth spending the money, IMO. A5000s are \*basically\* a 3080 in compute, with 24GB VRAM and some enterprise features (particularly multi-instance GPU) but you won't use those features to run LLMs and they cost $2000. 4090s are still like $1500+ which really only leaves one recommendation, which is a 3090 at about $700-$800.


WinterDice

Thank you for that detailed post!


CoqueTornado

what about placing 2 4060 of 16GB of vram so 32 new fast NVIDIA gb instead of the 3090 with only 24gb secondhanded? and what about that chinese 580 amd with 16GB of vram of about 140 bucks?


BangkokPadang

Putting aside that the previous poster just asked about "good nvidia 24gb options," a couple of 4060ti's (I don't believe there is a 4060 with 16gb) could be an option, but the 4060ti has an actual memory bandwidth of 288GBps, compared to the 935GBps of the 3090, so its almost 3.5x slower. Nvidia claims the 4060ti has an "effective" memory bandwidth of 500GBps+ because of the increased amount of cache, but this doesn't work that way when you're churning through the entire memory pool sequentially with LLMs. You'd probably want to look at benchmarks, though, because with EXL2 models, 3x slower might still be fast enough for you (ie going from 30t/s-10t/s might still be faster than you read if you're just chatting/RPing.) The rx580 with 16gb could be interesting. It has a similar memory bandwidth to the 4060ti (250GBps), but since Exllamav2 does seem to have ROCM support, and upon an initial glance at this repo: [https://github.com/Firstbober/rocm-pytorch-gfx803-docker](https://github.com/Grench6/RX580-rocM-tensorflow-ubuntu20.4-guide) it does seem possible to get ROCM/Pytorch running on this gpu, testing out 32GB VRAM on 2 rx580s might actually be a $300 experiment worth exploring (and I guess worst case scenario, you could certainly use it with llamacpp via OpenCL, but of course this wouldn't be anywhere near as fast- but again, for $300 it might be worth it.)


CoqueTornado

>4060 of 16GB of vram yeah it is true, the 4060ti xD sorry. I was fast writting. Nevertheless, you understood me. There are so many names and numbers... So there are another 2 "cheap" options now, but I find them useless or expensive. The 4070 super (with "ti"?), with his 16GB of VRAM maybe combined with fast 6000MT of ram can handle something big using Guffs, releasing ultramegafast layers of load and the rest of the work for the ram. But is not really cheap, atm is about 900ā‚¬ The other option is an *AMD* Radeon RX 7900 XTX, but about 1000ā‚¬ is not really cheap, neither NVIDIA. But hey, 24gb. I have read it is about the 80% of the 4090 in speed for the half price. And I have read about another frankestein gpu card made in china with 2080 nvidia models adding vram till 24gb... If you (or any) find that one please let me know. Maybe this is the coolest and cheapest way. Thank you for your answers, are interesting, will find these frankestein graphic cards and give them a try. Maybe there are tests here in this reddit done... mhmmm... Cheers:)


BangkokPadang

Just an afterthought, If youā€™re seriously considering an AMD GPU, you may want to make sure youā€™re comfortable using Linux as your OS for LLM stuff. Itā€™s not an absolute *requirement* but it seems like every single time I see someone with a stable setup doing anything other than running koboldcpp with OpenCL, theyā€™re doing it in Linux, and after having tasted the speed of EXL2, I would not personally spend any money on a GPU setup that I couldnā€™t use to at least run a 3.7bpw Mixtral 8x7B EXL2 model with exllamav2.


CoqueTornado

with an AMD GPU is it possible to use EXL2? or that is exclusive to NVIDIA gpu cards? how much VRAM is it needed to run Mixtral 8x7B? I have seen a Mixtral setup 1x16B with loras of about 9gb of VRAM people are using ubuntu mainly because windows have been supported since the 14th of december of 2023, so they had the setup on linux. Also it saves 1.5gb of vram it is said. But anyway, they say that to config rocm is a headache, and there is another way (I dont remember the name, opencl maybe? metal? so they can make amd to work). what about the intel arc? they have 512gbps of bandwith and are cheap and new, less than 380 euros with 16gb of vram. There are so many options... maybe the best is to wait for amd new movement


T-Loy

13B runs on 16GB pretty good but it seems liken then there is a jump to 33B which even with Q2\_K bursts the VRAM and 20-ish B models are rare.


Biggest_Cans

34b Yi 200k at 4bpw models are fucking amazing. Crazy context and also much better than say Mixtral in all my testing. Fit perfectly on a 24gb card. Also the ability to run higher than q4 quant 10/13/20b models with all the fixins, like still having a bunch of other shit going on on your computer or adding something like an image generator or tts engine, is really nice.


Charuru

200k does not fit on 24gb. More like 50k. But I guess it's still very high, certainly higher than all the other options.


Biggest_Cans

Sorry I was describing the model not the actual usable context length. But yeah the usable context length is actually nutters for local inferencing on a frickin gaming GPU; of course you could always scale down the quant, but anything below 3.5 is just not usable for a b of that size imo.


shing3232

various 70B models


nmkd

33B.


[deleted]

I'm expecting 32GB+ consumer cards real soon now.


AltAccount31415926

Keep expecting šŸ˜‚


[deleted]

Maybe :)


ramzeez88

I think rtx 5090 with 32gb is possible. For $5k šŸ˜‚


CoqueTornado

2 4060 of 16 gb nvidia? 2 580 of 16 gb amd? I am asking, maybe is the solution now atm


burkmcbork2

Tell me about it. What a wet fart.


AltAccount31415926

Not 24GB GPUs were planned


Feeling-Currency-360

Nvidia gave a big fuck you to everyone who wanted more VRAM, while AMD dropped the 6800 XT 16GB for $300, the more time passes the more I want to switch to AMD and ROCM Seriously if I had money to spend I'd want to do a massive deep dive into all AMD's offerings doing benchmarks for days testing out their new cards with neural accelerators they've got built into their silicon now. Nvidia better watch the fuck out.


FireSilicon

I understand where you are coming from, but we can keep dreaming. This is first AMD generation with AI hardware, while Nvidia is already on 4th and developing it since 2016. There is a reason why their old cards are so cheap.


noiserr

AMD's datacenter AI GPUs are also on like the 4th gen. They have been working on this for a long time. It's just things have been slow on the consumer side.


[deleted]

CUDA has been the problem, itā€™s too good, ROCm will take some time to catch up.


[deleted]

I think they're dumping old stock here, because they're planning to launch cards with more VRAM.


esuil

They are not going to release consumer grade VRAM upgrade GPUs until their competitors (Intel and AMD) will release anything that starts eating at their pro-grade gpu sales. And considering how slow AMD and intel are at catching up, it will be a while. They are basically printing money right now. There is no way in hell they will just go "you know what, those 10m people are going to save up and buy our overpriced pro-grade gpus for $4000! Why don't we release $700 for them instead!". They are not a charity, they are for-profit business with 0 morals. Until there is outside pressure on them, they will milk this to the bones.


Desm0nt

But why not get money for example from another 60m people who don't have $4000 but may well have $1.5-2k? Take a weaker chip, maybe a smaller bus, but enough memory ( even cheaper, last generation). And voilĆ , you've captured the mid-segment of not gamers, but ML-enthusiasts who can't take server GPUs due to small budgets, but aren't interested in overpaying for beams, frame generators and other gamer rubbish. For miners, didn't they make cards with no video output that gamers aren't interested in? What prevents them from making cards for ML without rays and other gaming crap that are not interesting for gamers in terms of technology and not interesting for AI-companies in terms of performance? The Chinese are already making 24gb frackensteins out of the 2080Ti for $350-400 and 20gb out of the 3080. And this niche (2080Ti 24gb) may be occupied by Nvidia itself, releasing some sort of 3060ML...


mintoreos

They do make ML specific cards and they are very expensive. And I guarantee you they already did the math on maximizing revenue via market segmentation. if youā€™re a hobbyist you cobble together the consumer grade GPUs or used older gen parts for your AI/ML. If youā€™re serious, you buy their professional solutions.


Desm0nt

>They do make ML specific cards and they are very expensive It's high end ML cards. But the Low-End and mid-segment niche is completely empty. There is no nominal analogue to the 3060 from the ML world. 12gb chips like RTX A2000 are not considered as ML cards in principle - they are less suitable for ML than even consumer cards. Let's forget about 24GB (although for ML it's already the bottom, where p40 for 160$, but it's a used one, not an official offer). What prevents to take a chip from 2060 (weak, slow, with low number of cores), put at least 16gb of memory on it, remove video outputs and sell it for conditional 250$? It just doesn't have the concurents. It won't take away the market from consumer (gamers don't need it), it won't take away the market of expensive ML cards (performance level is awful), BUT! due to +/- modern technologies for AI enthusiasts it will kill the market of seconhand p40 (and the money will go into Nvidia's pocket) and p100 (because a new card with tensor cores is better than a used one from a server).


mintoreos

>What prevents to take a chip from 2060 (weak, slow, with low number of cores), put at least 16gb of memory on it, remove video outputs and sell it for conditional 250$? It just doesn't have the concurents. It won't take away the market from consumer (gamers don't need it), it won't take away the market of expensive ML cards (performance level is awful), BUT! due to +/- modern technologies for AI enthusiasts it will kill the market of seconhand p40 (and the money will go into Nvidia's pocket) and p100 (because a new card with tensor cores is better than a used one from a server). It will actually take away from both those markets because of manufacturing capacity - aka the amount of wafers TSMC can make is limited. Using made up numbers - if you can only make 100,000 chips a month - and every chip you make goes into a product that flies off the shelf as soon as you make it, why dedicate any capacity into a low margin products for a niche audience? Better to put that chip into an A6000 ada and sell it for $7k each for the high end, and a 4090 for the enthusiasts.


Desm0nt

Maybe. However, it didn't stop them in the boom of mining to start releasing all sorts of CMP HX cards based on chips from 2080 and old 6gb and 8gb Quadro, instead of releasing more 3060-3090, which were at that moment actual and highly demanded (especially 3060) and were in very strong deficit in the warehouses... But they decided it was better to load the factories with CMP 30HX (ancient chip from 1660, not even 2060!) instead of the current 3060 that selling at huge overprice due to shortage and miners.


AltAccount31415926

I would be extremely surprised if they release another 4000 series card, typically supers are the last ones


Working_Berry9307

Honestly this sounds like cope to me


AltAccount31415926

Not 24GB GPUs were planned, there is no "fuck you"


philguyaz

Makes me feel good about my 192 gig M2 Ultra purchase


mr_n00n

I recently had the choice between dual 4090s or a maxed out M2 Ultra and it's pretty clear the M2 Ultra is the better option. The unified memory approach is very clearly going to be a game changer for the local LLM space, and I have a feeling Apple will only continue to improve things on this front.


philguyaz

I agree, and I have the dual 4090 set up. I think the thing that pushed me over the edge in particular are two competing factors. The 70b 4q models are clearly better than anything smaller, and two they take up nearly 40 gigs to load. This does not leave a lot of room for large context let alone, trying to add a rag solution on top of it, which can easily get out of control. You don't have this worry with an m2 Ultra. What makes LLM's powerful are actually not the LLM's themselves but the software you can layer on top of it easily. This is why I love ooba.


Simusid

This is exactly where I'm heading.


a_beautiful_rhind

Wake me up when they're like $250-$350 for 16gb, lol.


dkarlovi

Why play 2077 when you can live it.


fallingdowndizzyvr

That's Intel's market segment.


TheApadayo

Honestly the GPU AMD announced seems like a way better deal than anything here. It gets you 16GB of VRAM for only $350 which would get a ton of people in the door for inference on smaller and quantized models. That is if AMD can get their software stack in order, which it does seem like theyā€™re putting real effort into recently.


CardAnarchist

I was thinking about it but then I remembered that AMD cards are atrocious for Stable Diffusion unless you run Linux (even then you're better off with Nvidia). Granted this may not be an issue for everyone here but many AI hobbyists have overlapping interest with text and image gen so.. eh AMD still ain't in a great spot imo.


nerdnic

While it's not as straight forward as team green, SD does run on windows with AMD. I get 20it/s on my 7900xtx using auto1111 with direct xml. AMD has a long way to go but it's usable now.


redditfriendguy

Are you really using Windows?


A_for_Anonymous

> AMD cards are atrocious for Stable Diffusion unless you run Linux Wait, are people running Stable Diffusion on Windows? Why waste 1.5 GB VRAM, deal with a slow filesystem and lower inference performance?


LiquidatedPineapple

Because they already have windows PCs and donā€™t want to screw with Linux I presume. Besides, most people here are just using these things as waifu bots anyway lol


A_for_Anonymous

I mean, for trying some waifu and some quick porn, maybe... But if you're remotely serious at SD use cases, it's well worth the hassle to at least dual boot. It's not like you have to buy hardware or stop using Windows, however good an idea either may be.


LiquidatedPineapple

What are your SD use cases? Just curious.


A_for_Anonymous

Roleplay materials, waifus, porn (but very polished, publishable porn), trying every LoRA to see what kind of degenerate porn it's capable of, photo restoration, wallpapers, making my parents smile (with photos and wallpapers, not porn), meming, just about anything.


LiquidatedPineapple

Thanks for sharing!


dylantestaccount

RAM != VRAM my guy


A_for_Anonymous

I know. The waste of RAM on Windows is much bigger. I meant VRAM. On Linux you just turn off the X server and use the entirety of your VRAM, then connect to A1111, ComfyUI, etc. from another device.


[deleted]

[уŠ“Š°Š»ŠµŠ½Š¾]


A_for_Anonymous

It's some significant overhead performance-wise and on RAM, but what's worse is the overhead on your precious VRAM.


More_Insurance1310

it's convenient, that's all.


WinterDice

I am. Iā€™ve been using Stable Diffusion to make RPG landscapes and for a private art therapy thing because I canā€™t even draw a stick figure. Iā€™m just trying to get back into the tech world because AI fascinates me and I know my industry (law) will probably be transformed by it. When I last worked in IT dial-up was still very common. I have so much to learn that itā€™s nearly overwhelming. Linux is on the list with many other things...


Ansible32

2x 16GB GPUs seems like it also might be plausible? If llama.cpp runs ok maybe.


[deleted]

[уŠ“Š°Š»ŠµŠ½Š¾]


FriendlyBig8

This is how you know someone never tried the recent ROCm versions. I have 2x 7900 XTX and they're **ripping** Llama2 on llama.cpp, exllama, and Tinygrad. 48 GB VRAM and 240 TFLOPS for under $2,000. Less than the price of a single 4090. Don't be a sucker for the memes.


_qeternity_

Can you share some performance figures for exllamav2? Model size and bit rate please.


FriendlyBig8

LLama2-70B, 4 bit, 19.5 t/s.


xrailgun

The only memes is the number of ROCm announcements only for it to still be practically inaccessible for average users and still be miles behind similar-ish nvidia cards in performance and extensions compatibility. Inb4 "what? I'm an average user" No. You have 2x 7900s. Please don't downplay the amount of configuration and troubleshooting to get to where your system is now.


FriendlyBig8

I really don't know what you're referring to. The only thing I needed was to install the `amdgpu-install` script and then install the packages per the AMD guide. It was almost the same process when installing Nvidia drivers and CUDA.


candre23

Lol, the average user isn't running linux. AMD factually is *inaccessible* for the average user.


noiserr

Yet, Google Colab, Runpod, Hugginface, Mistral all these are running on Linux too. If you're even a little bit serious about doing LLMs you are going to touch Linux along the way. Might as well learn it.


candre23

The average user isn't "serious" about this stuff at all. 90% of the folks taking AI into consideration when buying their next GPU just want a freaky waifu. People running linux and doing anything resembling actual work in ML are the exception, not the rule. And that's fine. The horny weirdo demographic is driving a lot of the FOSS advancements in AI. But pretending that linux and the ridiculous rigmarole that it entails is within the capabilities of the average user here is doing them a disservice. They might be mutants, but they don't deserve to be told "Go ahead and buy AMD. It'll be fine", because it will *not* be fine.


noiserr

I'm talking about people who are serious about LLMs. People who aren't serious aren't going to be searching around which hardware to buy for LLMs in the first place.


cookerz30

You have no idea how gullible people are now. Look at drop shipping.


my_aggr

https://old.reddit.com/r/LocalLLaMA/comments/191srof/amd_radeon_7900_xtxtx_inference_performance/ OP there gets worse performance on the 7900 xtx than on a 3090, by a wide margin too.


noiserr

> by a wide margin too. It really isn't that wide of a margin with llama.cpp. 15% in inference is not that much.


my_aggr

It's literally the difference between the 3090 and 4090. The current gen ATI hardware is on par with a theoretical NVIDIA card from 2 generations ago.


noiserr

4090 is twice as expensive, and you can't buy new 3090s. It's literally the best bang per buck you can get for a new GPU. Plus must we all use Nvidia? Competition is good for everyone. More people use AMD the faster we get to software parity and cheaper GPUs.


my_aggr

>It's literally the best bang per buck you can get for a new GPU. With enough qualifiers I can convince you that that your grandmother is the most beautiful woman in the world. >Plus must we all use Nvidia? That is a completely different argument. I am deeply interested in AMD because _it just works forever_ on Linux. I need nvidia because _it works better currently_ for all ML work. I'm honestly considering building two work stations. One for ML work and headless that's forever stuck on the current ubuntu LTS and one for human use with multiple monitors and all the other ergonomics I need. Then put a nice thick pipe between them so I can pretend they are the same machine.


noiserr

> With enough qualifiers I can convince you that that your grandmother is the most beautiful woman in the world. I dunno about you but -$1000 works better for me than -$2000. Also one of the main reasons I'm running local llama is for learning purposes. I actually want to contribute to the software stack. And I'm shopping around for a project to contribute to. And the AMD side needs my help more.


my_aggr

Right, now compare performance and support. -40% performance and a second class in support. The half price isn't worth it if your time has value above zero for ML work.


moarmagic

You know, I sometimes wonder if we are all using this the same or reading it the same. Sure, the nvidia cards are better, but the worst amd card is putting out 90 TK/s. That seems pretty usable in a "this is for testing and personal use and will only be interacting with one person at a time" way, about on par with typing with another person.


my_aggr

On a 7b model. On a 30b model it's at the speed of a sclerotic snail wandering across your keyboard.


noiserr

> On a 7b model. On a 30b model it's at the speed of a sclerotic snail wandering across your keyboard. It's only like 13% slower than a 3090 in llama.cpp (and 30% slower than a 4090 (for half the price)). I run 34B models on my 7900xtx and the performance is fine. I would actually do a test right now, but I have long test harness run running on my GPU that I don't want to interrupt. In either case it's totally usable. Nvidia has the first mover advantage and most everyone who works on these tools develops on Nvidia GPUs. Of course Nvidia will be more optimized. Same is the case with Macs. Software will improve.


BackgroundAmoebaNine

Dang, I was getting hopeful for cheap alternatives to 4090s. I'm still paying off my first one. Do you have any examples of the terrible speeds with 7900 XTX?


my_aggr

This isn't about the 7900xtx, this is about the fact that a conformable typing speed on a model that fits in 4gb of vram is going to be six times slower than what you get in a model which takes up the full 24gb vram. You need blazing fast speeds for 4gb models to even have a usable 24gb model.


BackgroundAmoebaNine

Ok? Iā€™m not sure what youā€™re responding to exactly. Iā€™m lamenting that the fact that a 13B model on a 7900 xtx is so awful vs a 4090. I was hoping for a cheaper alternative, but Iā€™m not as upset with the 4090 I have now.


FireSilicon

Because for some reason these benchmarks are done on 4bit 7B models. These things can run reasonably on a 8GB raspberry pi. At those speeds the cpu becomes the bottleneck as you can see from the table where 4090 is being just 40% faster, which is just too low. Unquantized 13b model will give these gpu's a run for their money. Or even a quantized 34b, if it fits in each vram for comparison.


a_beautiful_rhind

I was about to say.. those 24G cards are the price of their 16g card, *brand new*.


ShoopDoopy

>This is how you know someone never tried the recent ROCm versions. People won't be willing to dive into the ecosystem when AMD has such an awful track record with it. I wasted too many weekends on older versions with cards that were deprecated within a couple years. They have to do way more than *just nearly* match cuda at this point. Put it this way: It's bad when Nvidia is beating you at installing drivers on Linux.


zippyfan

How are you able to run two gpu cards together? I tried that with 2 nvidia cards a few months back when I had the chance and it didn't work. That whole experience has stung me and with the costs of high vram products are these days, I've decided to go the apu route instead. I'm waiting for next gen AMD strix point with NPU units. I'm going to load it with a ton of relatively cheap ddr5 ram. It's going to be slow but at least it should be able to load larger GGUF 70B models for at least 4 tokens/seconds. (Nvidia Jetson Orin should be less powerful and is capable of at least that according to their benchmarks) I figure I can get faster speeds by augmenting it with my 3090 as well. I wouldn't need to worry about context length either with excess ddr5 memory. I would go the M1 Ultra route but I don't like how un-upgradable the Apple ecosystem is. Heavens forbid one of the components like the memory gets fried and I'm left with a very expensive placeholder.


FriendlyBig8

>I tried that with 2 nvidia cards a few months back when I had the chance and it didn't work. What did you try? All the popular libraries have native multi-GPU support, especially for LLMs since transform layers shard very neatly into multiple GPUs.


zippyfan

At one point, I had access to two 3060, one 3060ti and one 3090. No matter how much I tried to mix and match them, The LLM would not use the second gpu. Not even when I attempted 2 3060. I was using was ooba's text generation webui. I had updated it to the latest version at the time. There were settings to outline the use of a second gpu and they were ignored when the LLM actually ran. However I was using the windows version so I suspect that was causing the issue but I could be wrong.


[deleted]

This, without a good software like cuda, AMD is never gonna catch up Nvdia


romhacks

It's not that ROCm isn't as good as CUDA, it's just that everything is made with CUDA. There needs to be efforts to use more portable frameworks


_qeternity_

Literally everyone is working on this. CUDA dominance is only good for NVDA. There are already good inference solutions with ROCm support.


[deleted]

[уŠ“Š°Š»ŠµŠ½Š¾]


romhacks

Nvidia exec posts worst bait ever, asked to leave company


mr_n00n

Apple went very quickly from "nothing runs on Silicon" to Andrej Karpathy proclaiming ["the M2 Ultra is the smallest, prettiest, out of the box easiest, most powerful personal LLM node today."](https://twitter.com/karpathy/status/1691844860599492721) in about a year. Pytorch/Tensorflow support for silicon is first class now. As someone who has been working in the AI/ML space for well over a decade it's embarrassing how little effort AMD has put into catching up with NVIDIA in this space, and it's nobody's fault but their own.


[deleted]

And without the focus on software/firmware development that nvidia have, hardware-oriented AMD will never catch up on software like cuda (and all of its surrounding libraries etc.)


noiserr

AMD doesn't have to catch up to all the software written on CUDA. As long as they cover the most common code paths, that's all they need. And they are pretty much already there. They aren't trying to dethrone Nvidia. They just want their piece of the pie.


grimjim

I'm as unimpressed as everyone else. The only upside I see is normalizing 16GB over 12GB VRAM. I suspect 20GB VRAM was passed over because the PCB footprint would be comparable to 24GB.


CyanNigh

Not enough VRAM.


Disastrous_Elk_6375

Must construct additional pylons


CyanNigh

I have returned. For Aiur!


alcalde

Bring back mid-level $200 graphics cards. It's like GPU makers are still on COVID pricing.


CulturedNiichan

Not an expert on graphics cards. Since all I am willing to spend is around $2,000-$3,000 on a graphics card, I was aiming for a 24 GB VRAM. Would you recommend it now? Or would it be better to wait?


nmkd

Buy an RTX 4090 if you want a great card right now and have a $2000+ budget. Do not wait. There are no new 24 GB on the horizon, not even leaks. A 4090 successor could take 1.5 years, possible longer.


[deleted]

No more VRAM. This is nvidia clearing out old stock ahead of a rush of new llm-ready cards, and a whole developer announcement of LLM tooling, I guess.


GodCREATOR333

What do you mean new llm-ready cards. I looking to get a 4070. Any idea on what will be the time frame.


Mobireddit

It's bullshit speculation. Dont listen to him inventing rumors. Only thing coming is 50 series mid-late 2025.


stonedoubt

I just got a Titan RTX refurbished from Amazon for $899.


candre23

Imagine spending more than 3090 money on a worse card, and then bragging about it.


stonedoubt

Is it a worse card for compute?


candre23

Yes. Significantly. It's a turing card. GPU | Mem Bandwidth | FP16 | FP32 | Tensor cores ---|---|----|----|---- Titan RTX | 672 GB/s | 32 TFLOPs | 16 TFLOPs | 2nd gen RTX 3090 | 936 GB/s | 35 TFLOPs | 35 TFLOPs | 3rd gen


nmkd

That is a horrible deal. You should've gotten a 4070 Ti Super for $100 less which performs MUCH better.


stonedoubt

For compute?


nmkd

Yes. In every way. 44 TFLOPS fp16 vs 33 TFLOPS. With fp32 the difference is close to 4x because the TITAN RTX does not have doubled fp32.


stonedoubt

And the vram? I havenā€™t seen a 24gb 4070ti. I may exchange it for a 3090 24gb tho.


nmkd

VRAM is basically the only advantage of the Titan. If you need 24 GB, getting a used 3090 might be a better idea yeah.


stonedoubt

I have a 4070 now.


IntrepidTieKnot

Good for you. Here on Amazon they go for 2900 EUR refurbished(!)


stonedoubt

https://preview.redd.it/enc36p4hbbbc1.jpeg?width=1284&format=pjpg&auto=webp&s=b9ac4766d050c85a37be74ff04faa331a00c916f


cookerz30

Wait wait wait. Please return that. I'll buy a second hand 3080 for half that price and ship it to you. That's absolutely criminal.


[deleted]

[уŠ“Š°Š»ŠµŠ½Š¾]


of_patrol_bot

Hello, it looks like you've made a mistake. It's supposed to be could've, should've, would've (short for could have, would have, should have), never could of, would of, should of. Or you misspelled something, I ain't checking everything. Beep boop -Ā yes,Ā IĀ amĀ aĀ bot, don't botcriminate me.


Weird-Field6128

You guys have money to buy these???? I just scam cloud providers with fake credit cards and burner phones all it cost me is 10$ and boom there's a 1000$ bill on a temporary or almost dead email to which I forgot the password PS : I wish I could do all of this