TABLE OF CONTENTS
Updated: 26 Feb 2025
NVIDIA H100 SXM On-Demand
Deploying an LLM and looking to speed up inference? You’re not alone. One of the biggest headaches with LLMs is the slow, compute-heavy inference process especially when every millisecond counts. But here’s the good news: with the right optimisation techniques, you can reduce latency, boost performance and make your LLM run efficiently at scale. Continue reading our latest article below to explore 5 LLM Inference techniques to reduce latency and optimise your model performance.
What is LLM Inference?
LLM inference involves generating outputs from a trained model based on given inputs. Most popular decoder-only LLMs such as Llama 3.3 are pre-trained on the causal modeling objective and function as next-word predictors. They process a series of input tokens and generate subsequent tokens autoregressively until a stopping criterion is met, such as reaching a specified token limit or encountering a special end-of-sequence token.
This process comprises two main phases: the prefill phase and the decode phase.
Prefill Phase
In the prefill phase, the model processes the input tokens to compute intermediate states, specifically keys and values, essential for generating the first new token. Since the full extent of the input is known during this phase, the operations are highly parallelisable, effectively utilising GPU resources.
Decode Phase
The decode phase involves generating output tokens one at a time, with each new token depending on all previously generated tokens. This sequential nature makes the process less parallelisable, leading to the underutilisation of GPU compute capabilities. The latency in this phase is often dominated by the speed at which data such as model weights, keys, values and activations are transferred to the GPU from memory, making it a memory-bound operation.
Challenges in LLM Inference
Several challenges arise during LLM inference, primarily due to the models' size and complexity:
- Computational Costs: LLMs have billions of parameters, requiring significant computational power for real-time inference.
- Memory Requirements: Storing and processing large models demand substantial memory resources, which can limit deployment.
- Latency and Scalability: The autoregressive nature of token generation in the decode phase introduces latency, and scaling the model to handle multiple requests simultaneously can be challenging.
5 Optimisation Techniques for LLM Inference
To address these challenges, various optimisation strategies can be employed:
1. Model Compression
Reducing the model's size without significantly impacting performance is a primary approach to optimisation. Techniques include:
- Quantisation: Converting model weights and activations to lower precision, for example from 32-bit floating-point to 8-bit integers can reduce memory usage and accelerate computation. Without going into too many details, quantisation for LLMs can reduce the precision of weights while trying to keep the model’s inference results as accurate as possible.
- Pruning: Removing redundant or less significant neurons and connections in the model decreases complexity and size.
- Knowledge Distillation: Training a smaller student model to replicate the behaviour of a larger teacher model can maintain performance while reducing size.
2. Efficient Attention Mechanisms
The attention mechanism in handling long input sequences is a significant contributor to computational load. The optimisations include:
- Flash Attention: An algorithm that computes attention more efficiently by reducing memory bandwidth requirements and improving throughput.
- Sparse Attention: Limiting attention calculations to a subset of tokens reduces the number of operations, enhancing efficiency.
- Multi-head attention: Uses multiple attention heads to capture different aspects of input sequences, improving representation learning and model performance.
3. Batching Strategies
Processing multiple input sequences simultaneously can improve GPU utilisation:
- Static Batching: Combining requests with similar input lengths into a single batch. However, this can be suboptimal if requests have varying lengths, leading to inefficiencies.
- Dynamic Batching: Grouping requests in real-time based on their arrival, allowing for more flexible and efficient processing. This approach can adapt to varying input sizes and reduce idle times.
You May Also Like to Read: Static vs. Continuous Batching for LLM Inference
4. Key-Value Caching
Storing the computed keys and values from previous tokens can prevent redundant calculations:
- Static KV Caching: Caching keys and values for the input sequence to reuse during the decode phase, reducing computation for each new token.
- Dynamic KV Caching: Updating the cache as new tokens are generated, which is beneficial for long sequences or streaming applications.
5. Distributed Computing and Parallelisation
Distributing the model and its computations across multiple devices can address memory and computational constraints:
- Model Parallelism: Splitting the model across multiple GPUs allows each device to handle a portion of the model, enabling the handling of larger models than a single device's memory would permit.
- Pipeline Parallelism: Dividing the model into stages, with each stage processed by a different device, allows for concurrent processing of different parts of the input sequence.
-
Tensor Parallelism: Splitting individual tensors across multiple GPUs enables parallel computation of matrix operations, improving efficiency for large-scale models.
Get Started with LLMs on Hyperstack
Looking for the ideal platform to support your LLM needs? Get started with our cloud platform today. We provide on-demand access to powerful GPUs like the NVIDIA H100 SXM and the NVIDIA H100 PCIe, ideal for handling the computational demands of LLMs. With our LLM Inference Toolkit, you can streamline deployment, simplify LLM management and optimise model performance in real-time. It supports open-source LLMs with flexible deployment options for local or cloud setups. Check out our guide below!
Get Started with the Hyperstack LLM Inference Toolkit
Explore Related Blogs
FAQs
What is LLM inference?
LLM inference is the process of generating outputs from a trained language model based on given inputs, typically following an autoregressive token-by-token generation process.
How does quantisation help in LLM inference?
Quantisation reduces model precision (e.g., FP32 to INT8), lowering memory usage and speeding up computations with minimal accuracy loss.
What is key-value caching in LLM inference?
Key-value caching stores computed keys and values to avoid redundant calculations, significantly reducing computation time in the decode phase.
What is Flash Attention, and how does it improve performance?
Flash Attention optimises attention computation by reducing memory bandwidth requirements and improving throughput and efficiency.
How does dynamic batching differ from static batching?
Dynamic batching groups incoming requests in real-time for better GPU utilisation, whereas static batching processes fixed-size batches, which may be inefficient.
What is the Hyperstack LLM Inference Toolkit?
The Hyperstack LLM Inference Toolkit is a suite of tools designed to streamline LLM deployment, optimise performance, and simplify model management for local and cloud-based setups.
Subscribe to Hyperstack!
Enter your email to get updates to your inbox every week
Get Started
Ready to build the next big thing in AI?