TABLE OF CONTENTS
Updated: 18 Nov 2024
NVIDIA A100 GPUs On-Demand
Right LLM GPU requirements for your large language model (LLM) workloads are critical for achieving high performance and efficiency. This practical guide will walk you through evaluating and selecting the best GPUs for LLM. Whether you looking for a GPU for LLM fine-tuning or deploying an LLM for inference tasks, we’ve got you covered.
GPU Recommended for Inferencing LLM
For a detailed overview of suggested GPU configurations for inference LLMs with various model sizes and precision levels, refer to the table below. This shows the suggested LLM inference GPU requirements for the latest Llama-3-70B model and the older Llama-2-7B model.
GPU Recommended for Fine-tuning LLM
For a detailed overview of suggested GPU configurations for fine-tuning LLMs with various model sizes, precisions and fine-tuning techniques, refer to the table below. This shows the suggested GPU for the latest Llama-3-70B model and the older Llama-2-7B model.
Fine-tuning an LLM
If you're fine-tuning an existing open-source LLM, several design decisions will impact the GPU requirements:
Fine-tuning Technique
The first decision is whether to fine-tune the complete model (full supervised fine-tuning) or employ a more efficient method that adds or updates smaller parts of the model. In general, we recommend starting with parameter-efficient fine-tuning techniques (PEFT) such as LoRA (Low-Rank Adaptation) or QLoRA (Quantised Low-Rank Adaptation). These methods significantly reduce the GPU memory requirements compared to fully supervised fine-tuning. HuggingFace provides a simple Python package to get started with Peft.
Model Size
The size of the model you're fine-tuning, often expressed in billions of parametres, is crucial. Most open-source models have a smaller version (e.g. Meta’s Llama-3 8B) and a bigger version (e.g. Meta’s Llama-3 70B). Larger AI models require more GPU memory. Generally, start with smaller models tailored to your use case and scale up as needed.
Precision
Precision refers to the numerical representation of the model's parametres. Models can be trained and executed at different precision levels, such as full float32 precision, float16, bfloat16, int8 (8-bit quantisation), or int4 (4-bit quantisation). Lower precision (quantisation) reduces the GPU memory requirements but may introduce some accuracy trade-offs. Generally, it's advisable to use mixed precision (float16/bfloat16) or quantised precision (int8/int4) for large models to optimise GPU memory usage.
Batch Size
The batch size determines how many training samples are processed by the model in each training batch. While a larger batch size can improve throughput, it also increases the GPU memory requirements. Finding the optimal batch size is crucial for maximising GPU utilisation while staying within memory constraints.
Inferencing an LLM
When deploying an LLM for text generation or other AI inference tasks, the following factors should be considered:
Inference Engine
Using an optimised inference engine like vLLM, TGI or Triton can significantly improve inference speeds and efficiency. These engines leverage techniques like kernel fusion, operator optimisation and quantisation to accelerate inference on GPUs.
Model Size
Similar to fine-tuning, the size of the LLM (in billions of parametres) directly impacts the GPU memory requirements for inference tasks. Larger models require more GPU memory for efficient inference.
Model Architecture
Certain LLM architectures, such as Mixture of Experts (MoE), optimise inference speeds compared to traditional dense architectures. MoE models leverage sparse computations and conditional computation, allowing for efficient inference on large models while optimising resource usage. For example, Mistral’s Mixtral 8x22B model has 141B parametres but during inference it uses only 39B parametres (i.e. the inference speed is similar to a 39B parametre model during inference). However, the model still needs to be held in memory, so you still need to fit 141B parametres into memory.
Batching
Batching multiple inference requests together can improve throughput and GPU utilisation. However, larger batch sizes also increase memory requirements, so finding the optimal balance between batch size and memory usage is essential for efficient inference.
Input and Output Lengths
The expected lengths of input texts and generated outputs can affect GPU memory requirements and inference latency. Longer input texts generally require more GPU memory, while longer output texts increase the time spent on text generation, potentially impacting inference performance.
The recommendations here are just rough estimates to get you started. You'll want to do your thorough testing and benchmarking to test an LLM's performance and resource needs for different GPU setups. Want to know more about GPU selection and LLMs? Check our presentation at NVIDIA GTC 2024, one of the world’s largest AI conferences.
Our GPU Selector Tool is Live!
Not sure about which configuration is best for your model and requirements?
Get started with our GPU Selector Tool to find the ideal GPU for fine-tuning and inference, tailored to your project's needs.
FAQs
How to choose the right GPU for LLM?
There is no one-size-fits-all solution when It comes to GPU selection for LLMs. It's essential to evaluate your unique needs, constraints and long-term goals to ensure that your GPU infrastructure can support the demanding workloads of these large models.
Which is the best GPU for fine-tuning LLM?
For a detailed overview of suggested GPU configurations for fine-tuning LLMs with various model sizes, precisions and fine-tuning techniques, refer to the bullets below.
- For full fine-tuning with float32 precision on Meta-Llama-3-70B model, the suggested GPU is 8x NVIDIA A100 (x2).
- For full fine-tuning with float16/float16 precision on Meta-Llama-3-70B model, the suggested GPU is 4x NVIDIA A100.
- For full fine-tuning with float32 precision on the smaller Meta-Llama-2-7B model, the suggested GPU is 2x NVIDIA A100.
- For full fine-tuning with float16/float16 precision on Meta-Llama-2-7B, the recommended GPU is 1x NVIDIA RTX-A6000.
Which is the best GPU for inferencing LLM?
For the largest most recent Meta-Llama-3-70B model:
- For float32 precision, the recommended GPU is 4xA100
- For float16/float16 precision, the recommended GPU is 2xA100
- For int8 precision, the recommended GPU is 2xRTX-A6000
- For int4 precision, the recommended GPU is 1xRTX-A6000
For the smaller and older Meta-Llama-2-7B model:
- For float32 precision, the recommended GPU is 1xRTX-A6000
- For float16/float16 precision, the recommended GPU is 1xRTX-A6000
Subscribe to Hyperstack!
Enter your email to get updates to your inbox every week
Get Started
Ready to build the next big thing in AI?