<img alt="" src="https://secure.insightful-enterprise-intelligence.com/783141.png" style="display:none;">

NVIDIA H100 SXMs On-Demand at $3.00/hour - Reserve from just $2.10/hour. Reserve here

Deploy 8 to 16,384 NVIDIA H100 SXM GPUs on the AI Supercloud. Learn More

|

Published on 16 Jan 2025

How Much VRAM Do You Need for LLMs?

TABLE OF CONTENTS

updated

Updated: 16 Jan 2025

NVIDIA H100 SXM On-Demand

Sign up/Login

If you plan to deploy or fine-tune advanced LLMs like the Llama 3, you're probably already thinking about the challenges that come with it- especially the hefty VRAM requirements. Handling extensive datasets and intricate algorithms requires adequate VRAM for smooth and efficient LLM training and inference. Without it, you're risking slowdowns or the inability to run your model. In this post, we’ll explore why VRAM is imperative for working with LLMs and how to determine how much you need.

What is VRAM and Why Does It Matter?

A VRAM is a high-performance memory type built into GPUs to handle parallel computations. While initially designed for rendering, VRAM has become an imperative requirement for running LLMs to handle data-intensive computations.

  1. Storing Model Parameters: LLMs contain millions to billions of parameters that must reside in VRAM during inference and training. So, an adequate VRAM ensures these parameters are accessible at high speeds for smooth and efficient computation

  2. Managing Activations and Intermediate Data: During processing, LLMs generate substantial intermediate data (activations) that occupy VRAM. With a sufficient VRAM capacity, you can manage this data without causing bottlenecks.

  3. Parallelising Computations: Processing multiple inputs simultaneously with batching improves computational efficiency. However, larger batch sizes demand more VRAM to accommodate the concurrent data that directly influence throughput.

You may also like to read our blog on Static vs. Continuous Batching for LLM Inference.

Factors That Influence VRAM Usage

Several factors can influence VRAM usage for LLM, including:

Model size

The number of parameters directly determines VRAM usage. A larger model requires more memory for weights, activations and gradients. Thus, balancing the model’s size with available resources is important to prevent bottlenecks.

Precision

The numerical precision used for computations impacts memory consumption:

  1. FP32 (32-bit floating point): High precision means you need high VRAM usage.
  2. FP16 (16-bit floating point): Moderately reduced VRAM needs while maintaining accuracy.
  3. Int8 (8-bit integer): Significant reduction in memory requirements, often used for inference.
  4. Int4 (4-bit integer): Further reduces memory consumption, suitable for highly optimised inference scenarios where a reducing memory size is essential.

Batch size

Increasing batch size improves processing efficiency but significantly increases VRAM usage. Doubling the batch size can nearly double VRAM consumption.

Checkpointing and Gradient Accumulation

Techniques like gradient checkpointing and gradient accumulation can help reduce memory requirements during training:

  • Gradient checkpointing: Gradient checkpointing reduces memory usage by saving only a subset of activations during the forward pass. During backpropagation, the model recomputes the missing activations as needed. This approach decreases memory consumption but may increase computation time due to the recomputation overhead. In practice, gradient checkpointing can slow down training by about 20%.
  • Gradient accumulation: Gradient accumulation allows you to simulate larger batch sizes without increasing memory usage. Instead of updating the model weights after each batch, gradients are accumulated over multiple smaller batches. The optimiser updates the weights only after a predefined number of accumulation steps. This technique reduces VRAM requirements but results in longer training times. 

Type of Workload

The type of LLM workload you choose also impacts the VRAM requirements, for example:

Fine-Tuning

Fine-tuning pre-trained LLMs on specific tasks tends to demand more VRAM compared to other workflows. If you use techniques like quantisation or reducing batch sizes this can help optimise VRAM usage for more efficient fine-tuning.

Parameter-Efficient Fine-Tuning (PEFT): PEFT methods, such as Low-Rank Adaptation (LoRA), enable efficient adaptation of large pre-trained models to new tasks by introducing a smaller number of trainable parameters. This approach significantly reduces VRAM usage compared to full fine-tuning. For instance, fine-tuning a 7 billion parameter model using LoRA can lower memory requirements by up to 5.6 times, enabling fine-tuning on systems with limited VRAM [See Source].

Inference

Inference generally requires less VRAM than fine-tuning or training. However, larger models and more complex tasks will naturally lead to higher VRAM demands, particularly for high-performance or real-time inference. For shorter text inputs (less than 1024 tokens), the memory needed for inference is primarily determined by the memory required to load the model weights [See Source].


Estimating VRAM Requirements

Check out how to find the estimated VRAM requirements for Inference and Fine-tuning:

VRAM Requirements for Inference

The VRAM needed for inference primarily depends on:

  • Model Size: Larger models with more parameters require more VRAM to load the weights into memory.
  • Precision: The bit precision of the model’s weights (e.g., FP32, FP16, or Int8) significantly affects memory consumption. Higher precision (FP32) requires more VRAM, while lower precision (e.g., FP16 or Int8) uses less memory.
  • Batch Size: For inference with batch size 1, you can roughly estimate VRAM by multiplying the model's parameters by the precision factor (bytes per parameter) and a small overhead factor for activations and temporary data structures.

To estimate the VRAM required for Inference for an LLM like Llama 3 70B, you can use the following formula for an easy start:

Here’s what each symbol means:

M

The required GPU memory is gigabytes (GB).

P

The total number of parameters in the model, for example, a model with 7 billion parameters (7B).

4B

The size of each parameter is typically 4 bytes.

32

This depicts the number of bits in 4 bytes.

Q

The bit precision is used for loading the model (such as 16 bits, 8 bits, or 4 bits).

1.2

A factor that accounts for a 20% additional memory overhead on the GPU beyond the parameters themselves.

Source: Calculating GPU Memory for Serving LLMs 

For a Llama 3 70B model loaded in 16-bit format, the calculation for GPU memory would be:

This results in an estimated GPU memory requirement of 168 GB for loading the model.


VRAM Requirements for Fine-Tuning

Fine-tuning typically requires more VRAM compared to inference due to additional memory overhead. The VRAM needed for fine-tuning primarily depends on:

  • Model: The base model needs to be loaded in memory.

  • Precision: As with inference, the chosen precision affects VRAM consumption.

  • Optimiser states: Optimisers like AdamW and others maintain additional state information for each parameter, which increases VRAM usage.

  • Gradients: During backpropagation, gradients are computed and stored for each parameter. This increases memory usage as these gradients are typically stored in the same precision as the weights.

  • Activations: Activations from the forward pass must be retained until backpropagation is complete. This can significantly increase VRAM requirements, especially for deep networks with many layers.

  • Batch Size: For a batch size of 1, VRAM can be estimated by factoring in the memory used for model weights, gradients and optimiser states.

Total Memory Required:

The memory needed for fine-tuning and inference can vary based on the method used and the precision of the model.

  •  Normal fine-tuning typically demands 3 to 4 times more memory than inference at the same precision.Parameter-efficient fine-tuning significantly reduces memory usage, making it only slightly greater than inference.

To help estimate your specific GPU memory requirements more accurately, we recommend using our GPU selector.

Choosing the Right GPU for LLM

Apart from the formula mentioned above, you can choose other options like a GPU selector tool. Selecting the right GPU is imperative for efficient training and deploying LLMs. The ideal GPU depends on factors such as model size, precision requirements and workload type. 

To get started, you must consider the following:

GPU Type

Choosing high-performance cloud GPUs for LLMs like the NVIDIA A100 or NVIDIA H100 offers substantial VRAM and computational power, suitable for large-scale LLM tasks. Not sure which configuration is best for your model and requirements? Get started with Hyperstack GPU Selector Tool to find the ideal LLM GPU for fine-tuning and inference, customised for your project's needs. 

Fine-tuning vs Inference

Training typically requires more VRAM than inference. For example, fine-tuning a 70B parameter model with 16-bit precision may need 2 to 4 NVIDIA A100 GPUs with 80GB memory, while inference could be managed with fewer resources. For LLM Inference with smaller and older Meta-Llama-2-7B models with float16 precision, a recommended GPU would be the 1 NVIDIA RTX-A6000, which offers 48GB of VRAM to handle the task efficiently.

See our Guide to Choosing the Right GPU for LLM to learn more!

Precision Levels

Using lower precision can reduce your VRAM requirements. For instance, 8-bit quantisation can halve memory usage compared to 16-bit precision. With simple parameter changes, you can load HuggingFace models at half precision. Or, you can use bitsandbytes library to load models in 8-bit or 4-bit precision.

Deploy Your LLM on Hyperstack Now. Access Powerful GPUs for LLM in just minutes!

FAQs

How does VRAM affect LLM performance?

VRAM is critical for storing model parameters, activations, and intermediate data during training and inference. It ensures that large language models can run efficiently without frequent data transfers to system memory, minimising bottlenecks. Adequate VRAM leads to faster and smoother computations, enhancing model performance.

What happens if my GPU doesn't have enough VRAM for LLMs?

If your GPU lacks sufficient VRAM, the model will struggle with frequent data swaps between GPU and system memory. This can severely slow down training and inference or even make it impossible to load and process the model effectively. In the worst case, your GPU may run out of memory, leading to crashes.

How much VRAM do I need for fine-tuning a 70B Llama model?

For a Llama 70B model loaded in 16-bit precision, you'll need about 168 GB of VRAM. This requirement is based on the formula mentioned in our blog above including parameters, precision, and an additional 20% overhead for activation and other data. You'll likely need multiple high-memory GPUs to manage this workload efficiently.

How do I choose the right GPU for fine-tuning an LLM?

When selecting a GPU for fine-tuning LLMs, consider factors like model size, precision requirements, and task type, such as training or inference. Powerful GPUs like the NVIDIA A100 and the NVIDIA H100 with 80 GB of VRAM are ideal for large-scale models like Llama 70B, while smaller models may require less powerful GPUs.

How can the GPU Selector Tool help me choose the best GPU for my LLM?

The Hyperstack GPU Selector Tool for LLM is designed to help you match your LLM’s specific requirements with the optimal GPU configuration. It takes into account model size, precision, batch size, and workload type, ensuring you find the GPU that meets your needs for both fine-tuning and inference. Check out our GPU Selector for LLM here!

Subscribe to Hyperstack!

Enter your email to get updates to your inbox every week

Get Started

Ready to build the next big thing in AI?

Sign up now
Talk to an expert

Share On Social Media

2 Jan 2025

Kubernetes, also known as K8s has changed how modern software applications are deployed, ...

26 Dec 2024

From chatbots and virtual assistants to complex data analysis and content generation, ...