<img alt="" src="https://secure.insightful-enterprise-intelligence.com/783141.png" style="display:none;">

Access NVIDIA H100s from just $2.06/hour. Reserve here

Deploy 8 to 16,384 NVIDIA H100 SXM GPUs on the AI Supercloud. Learn More

|

Published on 16 Jul 2024

How to Choose the Best GPU for LLM: A Practical Guide

TABLE OF CONTENTS

updated

Updated: 18 Nov 2024

NVIDIA A100 GPUs On-Demand

Sign up/Login

Right LLM GPU requirements for your large language model (LLM) workloads are critical for achieving high performance and efficiency. This practical guide will walk you through evaluating and selecting the best GPUs for LLM. Whether you looking for a GPU for LLM fine-tuning or deploying an LLM for inference tasks, we’ve got you covered. 

GPU Recommended for Inferencing LLM

For a detailed overview of suggested GPU configurations for inference LLMs with various model sizes and precision levels, refer to the table below. This shows the suggested LLM inference GPU requirements for the latest Llama-3-70B model and the older Llama-2-7B model.

GPU Recommended for Inferencing LLM 1

GPU Recommended for Fine-tuning LLM

For a detailed overview of suggested GPU configurations for fine-tuning LLMs with various model sizes, precisions and fine-tuning techniques, refer to the table below. This shows the suggested GPU for the latest Llama-3-70B model and the older Llama-2-7B model.

GPU Recommended for Inferencing LLM 2

Fine-tuning an LLM

If you're fine-tuning an existing open-source LLM, several design decisions will impact the GPU requirements:

Fine-tuning Technique

The first decision is whether to fine-tune the complete model (full supervised fine-tuning) or employ a more efficient method that adds or updates smaller parts of the model. In general, we recommend starting with parameter-efficient fine-tuning techniques (PEFT) such as LoRA (Low-Rank Adaptation) or QLoRA (Quantised Low-Rank Adaptation). These methods significantly reduce the GPU memory requirements compared to fully supervised fine-tuning. HuggingFace provides a simple Python package to get started with Peft.

Model Size

The size of the model you're fine-tuning, often expressed in billions of parametres, is crucial. Most open-source models have a smaller version (e.g. Meta’s Llama-3 8B) and a bigger version (e.g. Meta’s Llama-3 70B). Larger AI models require more GPU memory. Generally, start with smaller models tailored to your use case and scale up as needed. 

Precision

Precision refers to the numerical representation of the model's parametres. Models can be trained and executed at different precision levels, such as full float32 precision, float16, bfloat16, int8 (8-bit quantisation), or int4 (4-bit quantisation). Lower precision (quantisation) reduces the GPU memory requirements but may introduce some accuracy trade-offs. Generally, it's advisable to use mixed precision (float16/bfloat16) or quantised precision (int8/int4) for large models to optimise GPU memory usage.

Batch Size

The batch size determines how many training samples are processed by the model in each training batch. While a larger batch size can improve throughput, it also increases the GPU memory requirements. Finding the optimal batch size is crucial for maximising GPU utilisation while staying within memory constraints.

Inferencing an LLM

When deploying an LLM for text generation or other AI inference tasks, the following factors should be considered:

Inference Engine

Using an optimised inference engine like vLLM, TGI or Triton can significantly improve inference speeds and efficiency. These engines leverage techniques like kernel fusion, operator optimisation and quantisation to accelerate inference on GPUs.

Model Size

Similar to fine-tuning, the size of the LLM (in billions of parametres) directly impacts the GPU memory requirements for inference tasks. Larger models require more GPU memory for efficient inference.

Model Architecture

Certain LLM architectures, such as Mixture of Experts (MoE), optimise inference speeds compared to traditional dense architectures. MoE models leverage sparse computations and conditional computation, allowing for efficient inference on large models while optimising resource usage. For example, Mistral’s Mixtral 8x22B model has 141B parametres but during inference it uses only 39B parametres (i.e. the inference speed is similar to a 39B parametre model during inference). However, the model still needs to be held in memory, so you still need to fit 141B parametres into memory.

Batching

Batching multiple inference requests together can improve throughput and GPU utilisation. However, larger batch sizes also increase memory requirements, so finding the optimal balance between batch size and memory usage is essential for efficient inference.

Input and Output Lengths

The expected lengths of input texts and generated outputs can affect GPU memory requirements and inference latency. Longer input texts generally require more GPU memory, while longer output texts increase the time spent on text generation, potentially impacting inference performance.

The recommendations here are just rough estimates to get you started. You'll want to do your thorough testing and benchmarking to test an LLM's performance and resource needs for different GPU setups. Want to know more about GPU selection and LLMs? Check our presentation at NVIDIA GTC 2024, one of the world’s largest AI conferences.

Our GPU Selector Tool is Live!

Not sure about which configuration is best for your model and requirements?

Get started with our GPU Selector Tool to find the ideal GPU for fine-tuning and inference, tailored to your project's needs. 

FAQs

How to choose the right GPU for LLM?

There is no one-size-fits-all solution when It comes to GPU selection for LLMs. It's essential to evaluate your unique needs, constraints and long-term goals to ensure that your GPU infrastructure can support the demanding workloads of these large models. 

Which is the best GPU for fine-tuning LLM?

For a detailed overview of suggested GPU configurations for fine-tuning LLMs with various model sizes, precisions and fine-tuning techniques, refer to the bullets below. 

  • For full fine-tuning with float32 precision on Meta-Llama-3-70B model, the suggested GPU is 8x NVIDIA A100 (x2).
  • For full fine-tuning with float16/float16 precision on Meta-Llama-3-70B model, the suggested GPU is 4x NVIDIA A100.
  • For full fine-tuning with float32 precision on the smaller Meta-Llama-2-7B model, the suggested GPU is 2x NVIDIA A100.
  • For full fine-tuning with float16/float16 precision on Meta-Llama-2-7B, the recommended GPU is 1x NVIDIA RTX-A6000.

Which is the best GPU for inferencing LLM?

For the largest most recent Meta-Llama-3-70B model:

  • For float32 precision, the recommended GPU is 4xA100
  • For float16/float16 precision, the recommended GPU is 2xA100
  • For int8 precision, the recommended GPU is 2xRTX-A6000
  • For int4 precision, the recommended GPU is 1xRTX-A6000

For the smaller and older Meta-Llama-2-7B model:

  • For float32 precision, the recommended GPU is 1xRTX-A6000
  • For float16/float16 precision, the recommended GPU is 1xRTX-A6000

Subscribe to Hyperstack!

Enter your email to get updates to your inbox every week

Get Started

Ready to build the next big thing in AI?

Sign up now
Talk to an expert

Share On Social Media

Hyperstack - Tutorials link

19 Nov 2024

Advanced LLMs like Llama 3.1-70B, Qwen 2-72B, FLUX.1 and more require vast computing ...

Hyperstack - Tutorials link

13 Nov 2024

The latest Qwen 2.5 Coder series is a groundbreaking model in code generation, repair and ...

Hyperstack - Tutorials link

11 Nov 2024

Stable Diffusion 3.5 was released earlier this week, building on the success of earlier ...