If you plan to deploy or fine-tune advanced LLMs like the Llama 3, you're probably already thinking about the challenges that come with it- especially the hefty VRAM requirements. Handling extensive datasets and intricate algorithms requires adequate VRAM for smooth and efficient LLM training and inference. Without it, you're risking slowdowns or the inability to run your model. In this post, we’ll explore why VRAM is imperative for working with LLMs and how to determine how much you need.
A VRAM is a high-performance memory type built into GPUs to handle parallel computations. While initially designed for rendering, VRAM has become an imperative requirement for running LLMs to handle data-intensive computations.
Storing Model Parameters: LLMs contain millions to billions of parameters that must reside in VRAM during inference and training. So, an adequate VRAM ensures these parameters are accessible at high speeds for smooth and efficient computation
Managing Activations and Intermediate Data: During processing, LLMs generate substantial intermediate data (activations) that occupy VRAM. With a sufficient VRAM capacity, you can manage this data without causing bottlenecks.
Parallelising Computations: Processing multiple inputs simultaneously with batching improves computational efficiency. However, larger batch sizes demand more VRAM to accommodate the concurrent data that directly influence throughput.
You may also like to read our blog on Static vs. Continuous Batching for LLM Inference.
Several factors can influence VRAM usage for LLM, including:
The number of parameters directly determines VRAM usage. A larger model requires more memory for weights, activations and gradients. Thus, balancing the model’s size with available resources is important to prevent bottlenecks.
The numerical precision used for computations impacts memory consumption:
Increasing batch size improves processing efficiency but significantly increases VRAM usage. Doubling the batch size can nearly double VRAM consumption.
Techniques like gradient checkpointing and gradient accumulation can help reduce memory requirements during training:
The type of LLM workload you choose also impacts the VRAM requirements, for example:
Fine-tuning pre-trained LLMs on specific tasks tends to demand more VRAM compared to other workflows. If you use techniques like quantisation or reducing batch sizes this can help optimise VRAM usage for more efficient fine-tuning.
Parameter-Efficient Fine-Tuning (PEFT): PEFT methods, such as Low-Rank Adaptation (LoRA), enable efficient adaptation of large pre-trained models to new tasks by introducing a smaller number of trainable parameters. This approach significantly reduces VRAM usage compared to full fine-tuning. For instance, fine-tuning a 7 billion parameter model using LoRA can lower memory requirements by up to 5.6 times, enabling fine-tuning on systems with limited VRAM [See Source].
Inference generally requires less VRAM than fine-tuning or training. However, larger models and more complex tasks will naturally lead to higher VRAM demands, particularly for high-performance or real-time inference. For shorter text inputs (less than 1024 tokens), the memory needed for inference is primarily determined by the memory required to load the model weights [See Source].
Check out how to find the estimated VRAM requirements for Inference and Fine-tuning:
The VRAM needed for inference primarily depends on:
To estimate the VRAM required for Inference for an LLM like Llama 3 70B, you can use the following formula for an easy start:
Here’s what each symbol means:
M |
The required GPU memory is gigabytes (GB). |
P |
The total number of parameters in the model, for example, a model with 7 billion parameters (7B). |
4B |
The size of each parameter is typically 4 bytes. |
32 |
This depicts the number of bits in 4 bytes. |
Q |
The bit precision is used for loading the model (such as 16 bits, 8 bits, or 4 bits). |
1.2 |
A factor that accounts for a 20% additional memory overhead on the GPU beyond the parameters themselves. |
Source: Calculating GPU Memory for Serving LLMs
For a Llama 3 70B model loaded in 16-bit format, the calculation for GPU memory would be:
This results in an estimated GPU memory requirement of 168 GB for loading the model.
Fine-tuning typically requires more VRAM compared to inference due to additional memory overhead. The VRAM needed for fine-tuning primarily depends on:
Model: The base model needs to be loaded in memory.
Precision: As with inference, the chosen precision affects VRAM consumption.
Optimiser states: Optimisers like AdamW and others maintain additional state information for each parameter, which increases VRAM usage.
Gradients: During backpropagation, gradients are computed and stored for each parameter. This increases memory usage as these gradients are typically stored in the same precision as the weights.
Activations: Activations from the forward pass must be retained until backpropagation is complete. This can significantly increase VRAM requirements, especially for deep networks with many layers.
Batch Size: For a batch size of 1, VRAM can be estimated by factoring in the memory used for model weights, gradients and optimiser states.
The memory needed for fine-tuning and inference can vary based on the method used and the precision of the model.
To help estimate your specific GPU memory requirements more accurately, we recommend using our GPU selector.
Apart from the formula mentioned above, you can choose other options like a GPU selector tool. Selecting the right GPU is imperative for efficient training and deploying LLMs. The ideal GPU depends on factors such as model size, precision requirements and workload type.
To get started, you must consider the following:
Choosing high-performance cloud GPUs for LLMs like the NVIDIA A100 or NVIDIA H100 offers substantial VRAM and computational power, suitable for large-scale LLM tasks. Not sure which configuration is best for your model and requirements? Get started with Hyperstack GPU Selector Tool to find the ideal LLM GPU for fine-tuning and inference, customised for your project's needs.
Training typically requires more VRAM than inference. For example, fine-tuning a 70B parameter model with 16-bit precision may need 2 to 4 NVIDIA A100 GPUs with 80GB memory, while inference could be managed with fewer resources. For LLM Inference with smaller and older Meta-Llama-2-7B models with float16 precision, a recommended GPU would be the 1 NVIDIA RTX-A6000, which offers 48GB of VRAM to handle the task efficiently.
See our Guide to Choosing the Right GPU for LLM to learn more!
Using lower precision can reduce your VRAM requirements. For instance, 8-bit quantisation can halve memory usage compared to 16-bit precision. With simple parameter changes, you can load HuggingFace models at half precision. Or, you can use bitsandbytes library to load models in 8-bit or 4-bit precision.
VRAM is critical for storing model parameters, activations, and intermediate data during training and inference. It ensures that large language models can run efficiently without frequent data transfers to system memory, minimising bottlenecks. Adequate VRAM leads to faster and smoother computations, enhancing model performance.
If your GPU lacks sufficient VRAM, the model will struggle with frequent data swaps between GPU and system memory. This can severely slow down training and inference or even make it impossible to load and process the model effectively. In the worst case, your GPU may run out of memory, leading to crashes.
For a Llama 70B model loaded in 16-bit precision, you'll need about 168 GB of VRAM. This requirement is based on the formula mentioned in our blog above including parameters, precision, and an additional 20% overhead for activation and other data. You'll likely need multiple high-memory GPUs to manage this workload efficiently.
When selecting a GPU for fine-tuning LLMs, consider factors like model size, precision requirements, and task type, such as training or inference. Powerful GPUs like the NVIDIA A100 and the NVIDIA H100 with 80 GB of VRAM are ideal for large-scale models like Llama 70B, while smaller models may require less powerful GPUs.
The Hyperstack GPU Selector Tool for LLM is designed to help you match your LLM’s specific requirements with the optimal GPU configuration. It takes into account model size, precision, batch size, and workload type, ensuring you find the GPU that meets your needs for both fine-tuning and inference. Check out our GPU Selector for LLM here!