Right LLM GPU requirements for your large language model (LLM) workloads are critical for achieving high performance and efficiency. This practical guide will walk you through evaluating and selecting the best GPUs for LLM. Whether you looking for a GPU for LLM fine-tuning or deploying an LLM for inference tasks, we’ve got you covered.
For a detailed overview of suggested GPU configurations for inference LLMs with various model sizes and precision levels, refer to the table below. This shows the suggested LLM inference GPU requirements for the latest Llama-3-70B model and the older Llama-2-7B model.
For a detailed overview of suggested GPU configurations for fine-tuning LLMs with various model sizes, precisions and fine-tuning techniques, refer to the table below. This shows the suggested GPU for the latest Llama-3-70B model and the older Llama-2-7B model.
If you're fine-tuning an existing open-source LLM, several design decisions will impact the GPU requirements:
The first decision is whether to fine-tune the complete model (full supervised fine-tuning) or employ a more efficient method that adds or updates smaller parts of the model. In general, we recommend starting with parameter-efficient fine-tuning techniques (PEFT) such as LoRA (Low-Rank Adaptation) or QLoRA (Quantised Low-Rank Adaptation). These methods significantly reduce the GPU memory requirements compared to fully supervised fine-tuning. HuggingFace provides a simple Python package to get started with Peft.
The size of the model you're fine-tuning, often expressed in billions of parametres, is crucial. Most open-source models have a smaller version (e.g. Meta’s Llama-3 8B) and a bigger version (e.g. Meta’s Llama-3 70B). Larger AI models require more GPU memory. Generally, start with smaller models tailored to your use case and scale up as needed.
Precision refers to the numerical representation of the model's parametres. Models can be trained and executed at different precision levels, such as full float32 precision, float16, bfloat16, int8 (8-bit quantisation), or int4 (4-bit quantisation). Lower precision (quantisation) reduces the GPU memory requirements but may introduce some accuracy trade-offs. Generally, it's advisable to use mixed precision (float16/bfloat16) or quantised precision (int8/int4) for large models to optimise GPU memory usage.
The batch size determines how many training samples are processed by the model in each training batch. While a larger batch size can improve throughput, it also increases the GPU memory requirements. Finding the optimal batch size is crucial for maximising GPU utilisation while staying within memory constraints.
When deploying an LLM for text generation or other AI inference tasks, the following factors should be considered:
Using an optimised inference engine like vLLM, TGI or Triton can significantly improve inference speeds and efficiency. These engines leverage techniques like kernel fusion, operator optimisation and quantisation to accelerate inference on GPUs.
Similar to fine-tuning, the size of the LLM (in billions of parametres) directly impacts the GPU memory requirements for inference tasks. Larger models require more GPU memory for efficient inference.
Certain LLM architectures, such as Mixture of Experts (MoE), optimise inference speeds compared to traditional dense architectures. MoE models leverage sparse computations and conditional computation, allowing for efficient inference on large models while optimising resource usage. For example, Mistral’s Mixtral 8x22B model has 141B parametres but during inference it uses only 39B parametres (i.e. the inference speed is similar to a 39B parametre model during inference). However, the model still needs to be held in memory, so you still need to fit 141B parametres into memory.
Batching multiple inference requests together can improve throughput and GPU utilisation. However, larger batch sizes also increase memory requirements, so finding the optimal balance between batch size and memory usage is essential for efficient inference.
The expected lengths of input texts and generated outputs can affect GPU memory requirements and inference latency. Longer input texts generally require more GPU memory, while longer output texts increase the time spent on text generation, potentially impacting inference performance.
The recommendations here are just rough estimates to get you started. You'll want to do your thorough testing and benchmarking to test an LLM's performance and resource needs for different GPU setups. Want to know more about GPU selection and LLMs? Check our presentation at NVIDIA GTC 2024, one of the world’s largest AI conferences.
Our GPU Selector Tool is Live!
Not sure about which configuration is best for your model and requirements?
Get started with our GPU Selector Tool to find the ideal GPU for fine-tuning and inference, tailored to your project's needs.
There is no one-size-fits-all solution when It comes to GPU selection for LLMs. It's essential to evaluate your unique needs, constraints and long-term goals to ensure that your GPU infrastructure can support the demanding workloads of these large models.
For a detailed overview of suggested GPU configurations for fine-tuning LLMs with various model sizes, precisions and fine-tuning techniques, refer to the bullets below.
For the largest most recent Meta-Llama-3-70B model:
For the smaller and older Meta-Llama-2-7B model: