<img alt="" src="https://secure.insightful-enterprise-intelligence.com/783141.png" style="display:none;">

Access NVIDIA H100s from just $2.06/hour. Reserve here

Deploy 8 to 16,384 NVIDIA H100 SXM GPUs on the AI Supercloud. Learn More

|

Published on 4 Jun 2024

Static vs. Continuous Batching for Large Language Model Inference

TABLE OF CONTENTS

updated

Updated: 18 Jun 2024

Large language models (LLMs) like Meta's Llama3, Mistral's Mixtral and Cohere's Command-R+ offer powerful text generation capabilities but serving inference requests for these requires careful consideration of batching strategies. In this blog post, we'll explore the difference between static and continuous batching for LLM inference and discuss their respective performance characteristics.

Also Read: Optimising AI inference for performance and Efficiency

Understanding Batching

Batching is the practice of bundling inference requests together for better GPU utilisation. By processing multiple prompts simultaneously, batching can significantly improve the tokens per second (tps) throughput for LLM generation. However, there is a trade-off with latency: the more prompts we can bundle together, the higher the throughput in tokens per second, but the higher the latency also. 

There are two main types of inference workloads to consider:

  1. Offline inference: Text generation in non-interactive applications like nightly inference for daily insights the next morning. In this case, latency is not an issue, and throughput (tokens per second) is the primary metric to optimise.
  2. Online inference: Text generation for interactive sessions like chatbots. Here, latency matters because it relates to user experience. In this case, both latency and throughput should be balanced.

Static vs. Continuous Batching

Our analysis focuses on the difference between static and continuous batching:

Static Batching

Prompts are collected in batches and sent as a single multi-dimensional input to the LLM for inference. The LLM is instantiated as a Python object, and the input is directly passed as a parameter to the generate function of the inference engine (in this case, vLLM). vLLM is a fast and easy-to-use library for LLM inference and serving. It is designed to provide high-performance and efficient LLM inference, making it suitable for various workloads and deployment scenarios. This approach is suitable for offline workloads only.

Continuous Batching

Prompts are sent individually, in parallel, towards an endpoint exposed by the vLLM container. This approach interacts via the OpenAI API, which has become an industry standard. Continuous batching is a technique that schedules and preempts inference requests in real-time to respond to dynamic changes in the inference server load. It is suitable for both offline and online workloads.

Experimental Setup

In our experiment, we used the Llama3-70B instruct model (meta-llama/Meta-Llama-3-70B-Instruct) as the LLM, evaluated on a 2x NVIDIA A100 compute node. The model was shared between the two GPUs using tensor parallelism. The experiments focused on open text generation with a limit of 1024 tokens, meaning each response could range from a few tokens to 1024 tokens. This setup is representative of real-world use cases where model responses typically vary in length.

Results and Findings

 static vs continuous batching

The key findings from our experiments are as follows:

  1. Increasing batch size improves throughput: From a batch size of 1 (no batching) to a batch size of 64, there were noticeable improvements in throughput (tokens per second).
  2. System saturation beyond batch size 64: Beyond 64 simultaneous requests, the system became saturated, and the tokens per second diminished. This effect was more pronounced for continuous batching, suggesting that the system could not keep up with the load. In the logs, messages about poor KV cache utilisation due to memory pressure were observed, increasing with larger batch sizes (e.g., from 100 to 200).
  3. Static batching outperforms continuous batching in some cases: For the regime where scaling to more tokens per second was successful, static batching was on par with or faster than continuous batching. For a batch size of 64, static batching achieved the highest throughput. This may be due to the overhead of the scheduling system in continuous batching, which was unable to effectively balance the uneven generation times for very long responses that tend to dominate the overall generation time.

Conclusion

The choice between static and continuous batching depends on the specific use case and requirements. Static batching can provide higher throughput but is suitable only for offline workloads, while continuous batching is more flexible and can handle both offline and online workloads.

Sign up today to experience the power of high-end NVIDIA GPUs for LLMs!

Get Started

Ready to build the next big thing in AI?

Sign up now
Talk to an expert

Share On Social Media

Hyperstack - Tutorials link

3 Sep 2024

Many developers await the opportunity to experiment with the latest LLMs, like the ...

Hyperstack - Tutorials link

29 Aug 2024

Deploying advanced AI models like FLUX.1 on Hyperstack provides the perfect environment ...

Hyperstack - Tutorials link

14 Aug 2024

Kubernetes has become the go-to platform for companies looking to scale their Generative ...