Large language models (LLMs) like Meta's Llama3, Mistral's Mixtral and Cohere's Command-R+ offer powerful text generation capabilities but serving inference requests for these requires careful consideration of batching strategies. In this blog post, we'll explore the difference between static and continuous batching for LLM inference and discuss their respective performance characteristics.
Also Read: Optimising AI inference for performance and Efficiency
Batching is the practice of bundling inference requests together for better GPU utilisation. By processing multiple prompts simultaneously, batching can significantly improve the tokens per second (tps) throughput for LLM generation. However, there is a trade-off with latency: the more prompts we can bundle together, the higher the throughput in tokens per second, but the higher the latency also.
There are two main types of inference workloads to consider:
Our analysis focuses on the difference between static and continuous batching:
Prompts are collected in batches and sent as a single multi-dimensional input to the LLM for inference. The LLM is instantiated as a Python object, and the input is directly passed as a parameter to the generate function of the inference engine (in this case, vLLM). vLLM is a fast and easy-to-use library for LLM inference and serving. It is designed to provide high-performance and efficient LLM inference, making it suitable for various workloads and deployment scenarios. This approach is suitable for offline workloads only.
Prompts are sent individually, in parallel, towards an endpoint exposed by the vLLM container. This approach interacts via the OpenAI API, which has become an industry standard. Continuous batching is a technique that schedules and preempts inference requests in real-time to respond to dynamic changes in the inference server load. It is suitable for both offline and online workloads.
In our experiment, we used the Llama3-70B instruct model (meta-llama/Meta-Llama-3-70B-Instruct) as the LLM, evaluated on a 2x NVIDIA A100 compute node. The model was shared between the two GPUs using tensor parallelism. The experiments focused on open text generation with a limit of 1024 tokens, meaning each response could range from a few tokens to 1024 tokens. This setup is representative of real-world use cases where model responses typically vary in length.
The key findings from our experiments are as follows:
The choice between static and continuous batching depends on the specific use case and requirements. Static batching can provide higher throughput but is suitable only for offline workloads, while continuous batching is more flexible and can handle both offline and online workloads.
Sign up today to experience the power of high-end NVIDIA GPUs for LLMs!