Case Studies

What is vLLM: A Guide to Quick Inference

Written by Damanpreet Kaur Vohra | Dec 26, 2024 9:45:00 AM

From chatbots and virtual assistants to complex data analysis and content generation, almost every industry relies on LLMs to power their AI systems. However, traditional inference techniques often struggle with scalability, high memory consumption and latency leading to slower response times and inefficient resource use. To meet the demands of real-time AI applications, vLLM provides a powerful solution. Continue reading as we explore more about vLLm in our latest blog.

What is vLLM?

vLLM is an open-source library designed to enhance the efficiency of large language model (LLM) inference and serving. It introduces PagedAttention, an innovative attention algorithm that optimises memory management by allowing non-contiguous storage of attention keys and values, significantly reducing memory waste. This optimisation leads to state-of-the-art serving throughput, achieving up to 24 times higher throughput compared to traditional methods. vLLM supports continuous batching of incoming requests and seamless integration with popular Hugging Face models, making it a flexible and high-performance solution for deploying LLMs in various applications

Benefits of vLLM

vLLM offers several benefits for LLM applications including:

Open-source

LLM is a freely available LLM inference and serving engine that allows developers to access, modify, and contribute to its codebase.

High Performance

vLLM is one of the fastest inference servers, achieving up to 24 times higher throughput than traditional methods. So you can significantly reduce response times for your AI applications.

Broad Model Support

vLLM seamlessly integrates with a wide range of open-source models like generative Transformer models in HuggingFace Transformers including Llama 3.1, Llama 3, Mistral, Mixtral-8x7B, Qwen2 and more. 

Easy Deployment

vLLM is a user-friendly tool. Its architecture minimises setup complexity, allowing you to get your models up and running quickly without requiring deep expertise in hardware optimisation or memory management. Check out our tutorials on getting started with vLLM:

How Does vLLM Achieve Fast Inference?

vLLM is designed to improve the efficiency and speed of LLM workloads. It achieves faster inference with several techniques including Paged Attention, Continuous Batching and Optimised CUDA Kernels. 

Let's explore how each of these improvements contributes to vLLM's performance.

Paged Attention for Memory Efficient LLM

One of the biggest challenges in LLM inference is the high memory consumption required to process large sequences of tokens. Traditional inference mechanisms struggle with scaling efficiently, often leading to excessive GPU memory usage and reduced throughput.

vLLM solves this problem with Paged Attention, an attention algorithm inspired by the classic idea of virtual memory and paging in operating systems. Instead of allocating all memory upfront, Paged Attention allocates memory in smaller chunks. This approach:

  • Reduces GPU Memory Overhead: By using only the memory needed at any given moment, Paged Attention avoids unnecessary allocations. 
  • Enables Larger Context Windows: Developers can work with longer sequences without worrying about memory constraints.
  • Improves Scalability: Multiple models or larger batch sizes can run simultaneously on the same hardware.

Continuous Batching to Maximise Throughput

Traditional batching methods in LLM inference often fail to fully utilise GPU resources. Static batching requires waiting until a batch is filled before processing, leading to underutilisation during periods of low activity.

vLLM introduces Continuous Batching, an innovative approach that dynamically merges incoming requests into ongoing batches. This system offers:

  • Higher Throughput: By continuously feeding the GPU with data, vLLM minimises idle time and maximises utilisation.
  • Reduced Latency: Real-time applications benefit from faster response times, as requests no longer have to wait for a full batch.
  • Support for Diverse Workloads: Continuous Batching adapts seamlessly to varying request sizes and arrival patterns, making it ideal for multi-tenant environments.

You May Also Like to Read: Static vs. Continuous Batching for LLM Inference

Optimised CUDA Kernels

Optimised CUDA kernels are tailored to perform low-level GPU operations with maximum efficiency. CUDA is a parallel computing platform by NVIDIA for accelerating AI workloads. vLLM takes this a step further by fine-tuning its kernels specifically for LLM inference. These optimisations include integration with FlashAttention and FlashInfer, resulting in faster end-to-end inference. The optimised kernels are designed to leverage the full potential of NVIDIA GPUs such as the NVIDIA A100 and the NVIDIA H100 to ensure top-tier performance across hardware generations. 

Performance Comparison of vLLM with Other Alternatives

The official benchmark results of vLLM (as reported by vLLM here) show vLLM v0.6.0 against several alternative serving engines, including TensorRT-LLM r24.07, SGLang v0.3.0, and lmdeploy v0.6.0a0. These benchmarks were conducted across various models and datasets to determine how well each engine performs under different conditions.

Benchmark Setup

The following configurations were used for benchmarking vLLM:

1. Models: vLLM was tested on performance on two popular models, Llama 3 8B and Llama 3 70B. 

2. Hardware: The official vLLM benchmarks were carried out using high-end GPUs like the NVIDIA A100 and NVIDIA H100.

3. Datasets: The tests were performed using three datasets:

  • ShareGPT: Comprising 500 prompts randomly sampled from the ShareGPT dataset.
  • Prefill-heavy Dataset: A synthetic dataset using the sonnet dataset with an average of 462 input tokens and 16 output tokens.
  • Decode-heavy Dataset: Another synthetic dataset generated from the sonnet dataset, containing similar input tokens but with a significantly higher average of 256 output tokens.

4. Metrics: The benchmarks evaluated the engines based on:

5. Throughput: The number of requests served per second (requests per second, QPS) under simultaneous request loads.

See vLLM's latest benchmarks here (as reported by vLLM)

vLLM Performance Results

For throughput, vLLM showed the highest performance [See the results below] on the NVIDIA H100 GPUs for both Llama 3 8B and Llama 3 70B models compared to the other serving engines. vLLM outperformed all alternatives in the ShareGPT and Decode-heavy datasets. While TensorRT-LLM is a strong player in this space, especially with its hardware-optimised inference pipeline, vLLM showed superior throughput and reduced token-generation times. The SGLang and lmdeploy engines also performed decently in some instances but vLLM surpassed them in both throughput and processing efficiency for more complex queries.

Source: vLLM Throughput Improvement 

Conclusion

vLLM significantly enhances memory management, reduces latency and boosts throughput, making it ideal for real-time AI applications. With broad model support and seamless integration, vLLM delivers unmatched performance and efficiency on high-end NVIDIA GPUs. Its user-friendly design enables quick deployment without extensive expertise, helping developers to build and run their LLM applications efficiently.

New to Hyperstack? Sign Up Now to Get Started with Hyperstack

FAQs

What is vLLM? 

vLLM is an open-source library designed to accelerate large language model inference by optimising memory usage and throughput through innovations like Paged Attention and Continuous Batching.

How does vLLM improve memory management? 

vLLM utilises Paged Attention, which reduces GPU memory overhead by allocating memory in small, required chunks, enabling the use of larger context windows and more efficient scaling.

Which models does vLLM support?

vLLM integrates smoothly with popular models from the Hugging Face Transformers library, including Llama 3.1, Llama 3, and other generative transformer models like Mistral and Qwen2.

What is the advantage of using vLLM for real-time applications? 

vLLM’s optimisations ensure faster response times by reducing latency and allowing real-time inference processing, which is especially beneficial for applications that demand immediate results, like chatbots or interactive AI systems.

Does vLLM require specific expertise to set up? 

No, vLLM is designed with ease of use in mind, and its architecture minimises setup complexity. You don’t need deep hardware or memory management expertise to deploy and start using it effectively.