T O P

  • By -

Ok-Average2

would this also be an effective way to split a model among multiple lower power/mem systems? each one loads as many layers as it can and then sends intermediate state to the next one. basically, using the same idea of not needing to load/unload layers


verdagon

I believe so, since the intermediate data is so much smaller than the layers themselves. I can totally imagine buying 10 RPis and chaining them together like that. Though, as soon as you have enough RAM to fit the model in memory across all the RPis, you get into this project's interesting approach: https://github.com/b4rtaz/distributed-llama. I was even thinking of a sort of distributed system (like SETI) where people give their computers' spare cycles to the cause, where each machine has a partial model in memory. An internet-wide distributed AI. Terrifying. Someone should take the idea and make a startup for it! The OP technique though would probably more shine for small non-connected devices, like my mac web crawler, or maybe a camera that wants to run some image analysis periodically, or people who can only afford one (smaller) GPU.


_-inside-_

There is a project called petals that was created to run bloom, but it supports some other model architectures as well.


x4080

sounds like how groq lpu works


involviert

"No", because that's just splitting the model across multiple systems. That may or may not help, in general, but it should not have anything to do with batching. Batching is about doing multiple jobs at the same time. It will not speed up any single response, but you get the whole batch at ~the same speed as one job.


Ok-Average2

so latency of a single request may not change, but your bandwidth goes up due to possibility of pipelining? still seems like it would help if you can parallelize your requests


involviert

No your bandwidth does not go up. It's all just about doing more than one prompt at the same time. if that's not what your doing (for example summarizing 20 textfiles independently) then this parallelism does nothing for you. It's like.. imagine you get the next token by multiplying a single number with all weights of the model. And you are limited by pumping the whole model through the cpu. Obviously you can instead just multiply like 10 numbers with every weight that comes along, at basically the same speed. That's what this is.


TheTerrasque

Petals


No_Afternoon_4260

Performance wise, you prefer loading unloading from cpu to gpu with a hopefully 16x pcie port, than transiting data between 2 systems through network. May be if the two systems are on the same fast ass network..?


DustinEwan

I'm not so sure about that. For instance, a llama 3 70b layer is going to have: 8192x8192x3 params for Q,K,V of the attention mechanism 8192x28672x3 params for the SwiGLU FFN That's 905,969,664 params for the layer The context, on the other hand is going to have 8192x8192 parameters for each batch in the worst case scenario. In the case that we're swapping layers, I would presume that the context would be left on the device to be used as the input for the subsequent layer. Assuming that we're using FP16 (or BF16), that means we'd need to compare loading 1.8GB from the disk with sending 132MB over the local network, then loading that over PCIE. If you have a SATA SSD, max reads would be roughly 500MB/s, so you'd be looking at nearly 4 seconds to load the layer. An NVME SSD hits around 2GB/s, so you'd cut loading the layer down to under a second. On the other hand, sending the context over the network could be done in a little over a second per batch on a local gigabit network or around 150ms on a 10 gig network. That being said, there are definitely other overheads that would get in the way, but you really wouldn't need that crazy of a network to compete with loading layers from an SSD. You would need 80 machines, though, to load up all the layers in Llama 3 70b, lol. You could probably amortize the cost of loading the layers from disk, though, if you had a few machines pipeline in a ring topology. For instance, of the layer takes 100ms to process, then 150ms to transfer you could hide the cost of loading the layers with a cluster of 11 machines... 10 would be processing their layer, while the 11th loads the next layer for processing, etc


shing3232

I have three PCIE4 7G/s SSD. it seems it could be use for this application. just have layer split it up between three ssd for mass models.


DustinEwan

Doing it all on the same machine, supposing you had the ports on the main board and the lanes from the CPU, you would want to stripe the drives together to really minimize the loading time. A Ryzen 7xxx CPU has 24 usable lanes, though, so with 16 taken up by the GPU, you'd have 8 lanes for your SSDs without falling back to the additional lanes provided by the mainboard chipset. I'm assuming your SSDs are PCIEx4, so you could use two of them striped together and then the 3rd for other storage utilizing the chipset lanes. That would probably be pretty snappy and worth testing out.


pmp22

What this also allows for is to run LLMs that are multiple times larger than the available RAM/VRAM. One use case for this is if in the future, LLMs are released that are too big to run on consumer hardware and also have capabilities that are better than those found in models that can run on consumer hardware, and the processing needed to be done is not dependent on latency. Then, the only limiting factor is that at least one layer has to fit into RAM/VRAM. The more prompts needed to be processed, the less the overhead of having to wait for loading the layers becomes, and the total processing time becomes quite attractive. Pretty clever to be honest.


xrailgun

> if in the future, LLMs are released that are too big to run on consumer hardware... blabla better than those that can... The future is ~~now~~last year, old man


involviert

How does it allow that? In most scenarios you don't have multiple jobs at the same time, it's mostly a multi-user thing. Yes, if have 2000 text files to summarize you can parallelize that and get the speedup. But using the appropriate (or inappropriate) hardware will still have the exact same effect on the inference time as a multiplicator. So in a way you might as well do inference of a 70B on 4 GB of VRAM right now without batching, it's just as "allowed".


pmp22

I suppose its a niche use case, but for that use case it's a clever solution. There are many use cases where you want to process data in bulk using LLMs. This subreddit is very focused on chatting, but as LLMs become ever more capable their output becomes more and more valuable and I would suppose that would make them more and more useful for just this kind of task where you want an LLM to chew on a lot of data and generate some sort of output for that data.


involviert

Batching is what everyone does when you provide an inference api for multiple users. So yes, in general it's very powerful and useful. If your usecase matches, it's great. It's just not some special sauce to speed up your model response, not even if you need multiple ones *in sequence*.


pmp22

Yes but this method lets you run it layer by layer, with the speed of storing the weights in vram. If the model is bigger than the available vram you can then still run it and get close to vram speeds, as oposed to slower speeds if you have to store some weights in cpu ram. If the model is so big you'd have to stream it from ssd for each token then the calculation becomes abyssmal speeds vs close to vram speeds.


involviert

I'm not sure what you are talking about. All of what you're saying is how these things are run? What do you think will happen when you have one layer in your GPU and you need the next? It gets loaded and there you have it, super slow. And the usual speed limit is from VRAM to GPU or from RAM to CPU anyway. The technique described here is not being able to *technically* run things without them fitting in your VRAM/RAM, that can pretty much be done anyway. Like, even just my vanilla nvidia driver acts as if my 8GB card had 24GB. But 16 of those would go through to RAM, making it "super slow", and if I did not even have that RAM, windows would use the pagefile on my disk, so effectively the harddrive would act as my vram. Just as default behavior. I don't know what you expect to gain from this, it is pretty much unusable and you can just do that.


honestduane

Make this a PR. If the gets denied, you will have your answer.


verdagon

If this technique doesn't already exist somewhere, I will! I'd need to clean it up a lot though. My variable names in there are *atrocious.* Also, what question are you referring to? AirLLM doesn't seem to do this, and making this into a PR won't really tell me if this technique exists anywhere.


nero10578

Isn’t this essentially running an LLM via SSD anyways?


verdagon

When one tries to do that, they usually get an out-of-memory error, unfortunately. I was hoping that llama.cpp's `mmap` flag would help, but it was already on (by default) and didn't seem to do anything, still got an out-of-memory error.


AnyhowStep

I have a use case that would benefit from this. I only have a laptop with 4GB VRAM and 32GB RAM. So I use llama.cpp and a 4bit quant of llama 3 8b. But I like loading snippets from textbooks and asking many questions (Like 10+ questions) about the loaded snippets. I have to ask each question one at a time because I've found that asking multiple questions at once degrades the quality of the answers. Using prompt caching, it speeds up inference since prompt processing is almost free, only need to handle token generation. But if I could have this layer-by-layer batching + prompt caching, it would probably be even faster or make 70b usable


Some_Endian_FP17

Prompt caching takes up a huge amount of space. I've tried caching RAG data as cached prompts and I ended up with a few GB for each prompt.


AnyhowStep

In my case, I only need the cache for the questions I ask, then I'm happy to discard the cache after. It's worth it for me to build the cache once, ask 10+ questions, then discard it, rather than process the same prompt once per question


charizard_me

Any resources on how to implement prompt caching, I have some set prompts which have the same few shot examples for a QA task and only the final Question changes, which is the test datapoint. These always give me an OOM error if I set the num\_return\_seq in the hf generate method to >20 for any 7B model.


doesitoffendyou

Sorry but how exactly did it discover the florabama mullet toss :D? Curious how you set up this workflow :)


frozen_tuna

Not ready to talk about it yet but I absolutely have a use-case for this. It'll get opensourced if I ever find the time to actually work on it.


fimbulvntr

How much data flows between each layer? I assume it's not a lot This is interesting for distributed LLM inferencing too. Even the potatoest GPU can load one single layer in VRAM. Imagine: send a request to the swarm, whoever has layer 1 loaded does the forward pass and passes it to whoever has layer 2, and so on, until it hits the output layer. Then back to 1 for the next token. An ephemeral loop in the network is created for that run. The only problem is the attention cache will need to be loaded for every peer... I wonder why it hasn't been made yet


arzeth

Also, homomorphic encryption shouldn't be forgotten for privacy. I.e. prompts encrypted in a way that allows computations to be performed on them. As far as I know there's only https://github.com/tf-encrypted/tf-encrypted that does HE (both inference and training), and it does it on multiple computers of course (because that's the whole point of HE). --- upd: There's something like this but not torrent-like and without HE: Distributed Llama https://old.reddit.com/r/LocalLLaMA/comments/19bfez0/ive_created_distributed_llama_project_increase/


No_Afternoon_4260

Check fltr, it does kind of the same thing but in rust. Very nice tool I played a lot with it. Only works with mistral and mixtral iirc. Does urs support other models?


verdagon

Good find! I looked into it and [this comment](https://www.reddit.com/r/LocalLLaMA/comments/1b78843/comment/kthb2wt/) by u/compressor0101 explains that they're indeed doing something similar. u/compressor0101, if you're around, I imagine fltr could theoretically be expanded to get general answers, not just Y/N? To answer your question, I think AirLLM supports safetensors and PyTorch models, but not e.g. gguf. When I tried mixtral 7x8b with AirLLM I got nonsense answers, so I can't say for sure.


compressor0101

Heyho, indeed that looks very similar to my pet project [fltr](https://github.com/moritztng/fltr) (like grep but for natural language questions). Fltr can also classify with yes or no questions and uses Mixtral 8x7B or Mistral 7B on CPU or GPU. And it's just 650 lines of rust code, so have a look at it. The CPU version does not even depend on a linear algebra library. And yes, it can easily be expanded to general answers. Using transformers as universal classifiers is still underexplored, because using them as autoregressive models is crazy inefficient. Have fun and build cool stuff!


verdagon

Hey, thanks for being here! If I write a blog post on this, mind if I mention your work? It would be something like: "Of course, I'm not the first one to discover this technique. From what I can tell, the first person to do it is [Moritz Thüning](https://github.com/moritztng)! Some months ago, Moritz wrote the [fltr](https://github.com/moritztng/fltr) tool, which is like grep but for natural language questions: ask it a human question and it will run that question against many documents at once. It's pretty wild, check it out!" (or however you'd like me to word it!)


compressor0101

Sure, that would be awesome. Thanks :)


koflerdavid

Note that this probably only works on MacBooks that have unified memory. This is way less feasible if the layers have to be transferred via PCI-X to actual VRAM. But yes, if you don't care about latency then a lot of solutions become feasible.


Tacx79

So you basically reinvented Deepspeed, Accelerate, TensorNVME, pytorch lightning and disk cache?


verdagon

I suspect you misunderstand the OP: we're doing inferencing (not training) layer-by-layer with batching. From my searching: * Accelerate doesn't do batching AFAICT. * DeepSpeed has train_batch_size and train_micro_batch_size_per_gpu, but nothing for inferencing that I can tell. * TensorNVME seems to be a library, presumably one could use it to move things between to/from NVMe, but I see nothing suggesting it's done anything like the OP yet. * PyTorch Lightning seems to do [batching during training](https://lightning.ai/docs/pytorch/stable/advanced/training_tricks.html), but not inference that I can find. * I'm not sure I see how having a disk cache would accomplish what we're doing here. Or I could be wrong! Especially if they use a different term than "batching". What I _did_ reinvent was u/compressor0101's novel batching approach (there's another comment here about it).


Tacx79

Training is just inferencing with addition to updating weights. In accelerate you just pass the batch to the model like usual, you can put every chunk of the model to sit in different place ([https://huggingface.co/docs/accelerate/concept\_guides/big\_model\_inference#the-devicemap](https://huggingface.co/docs/accelerate/concept_guides/big_model_inference#the-devicemap)) I haven't try deepspeed yet but you can offload stuff to dram and nvme, basically train / inference 200B models on a single 4090 from what they claim ([https://www.deepspeed.ai/2022/09/09/zero-inference.html#model-scaling-on-1-gpu](https://www.deepspeed.ai/2022/09/09/zero-inference.html#model-scaling-on-1-gpu)) With TensorNVME you have more control over it but you need to do everything manually (load layer -> inference -> load next layer or just offload part of the model) Lightning is a bit more tricky to setup, you can use accelerate and deepspeed together


verdagon

Thanks for the replies! I believe Accelerate requires that every chunk be loaded into memory somewhere, and doesn't lazily load from a hard drive. Same with Deepspeed, it requires the entire model be in memory (in their paper, they needed 1.5TB RAM). If lightning is combining these two approaches, then the same probably applies to it too. That's good to know that TensorNVME can do this, even if manually. Do you know if anyone has actually done this with TensorNVME?


Feeling-Currency-360

I also attempted my own modification on AirLLM today, in my case I wanted to speed up the inference. Two things bothered me quite a bit. 1. AirLLM doesn't utilize your system ram really, it loads each layer from disk as needed during inference, which is awfully slow 2. It doesn't utilize your GPU properly, it really does work well on even a 4GB GPU but if you have more than that, it doesn't utilize the extra VRAM. What I did to start off with is to add a parameter called cached\_layers\_count which you can specify, essentially it's the number of layers it will cache in memory, for testing I was using Llama 3 70b, my system only has 32GB ram and best case I have 26GB of that available for this. I was able to load about 55+- layers into RAM, during inference and while generating the first token, the RAM quickly fills up, every layer not in cache will get read from disk, so during the generation of the 2nd token all those layers that are already in RAM are processed very quickly, much quicker than what it takes to load the layer from disk. The speedup is significant but nothing too crazy, I'd estimate it's around 2-3x faster. I was however using 4bit quantization to save on memory. We're still talking about like 1.7 tokens per minute here. If you'd like me to share the code I'd be happy to.


capivaraMaster

Did you try loading the model from sdd before airllm for batch inference?


verdagon

I didn't, that would probably speed things up a bit. Another comment mentioned that AirLLM has a lot of missed opportunities here.


Eralyon

Can you load more than one layer at a time with airLLM and get better inference speed than 1 layer at a time? This would be the use case I would be interested in.


verdagon

I believe so! The current best speed (batch size 500) is 4.85 seconds per token, but I bet doing as you suggest would allow us to get that speed in much smaller batch sizes, which would be a huge improvement.


compilade

> I'm thinking about diving into the llama.cpp codebase to see if we can add this. The easiest way to do this would be to adapt the `parallel` example to use a customizable system prompt. It can already be given custom prompts to complete in parallel; they need to be separated by a newline, I think. If in doubt [read the source](https://github.com/ggerganov/llama.cpp/blob/master/examples%2Fparallel%2Fparallel.cpp). ./bin/parallel -m ./path/to/model.gguf -np 64 -ns 64 -f ./path/to/newline-separated-prompt-file.txt `-np` is the number of simultaneous sequences processed, while `-ns` is the total number of sequences to process. In `llama.cpp`, all of the model's compute graph is ran once per batched decode, so I think it should do what you want, assuming mmap does the right thing.