<img alt="" src="https://secure.insightful-enterprise-intelligence.com/783141.png" style="display:none;">

Access NVIDIA H100s from just $2.06/hour. Reserve here

Deploy 8 to 16,384 NVIDIA H100 SXM GPUs on the AI Supercloud. Learn More

|

Published on 2 Oct 2024

Improving LLM Fine-Tuning and Inference with High-Speed Networking

TABLE OF CONTENTS

updated

Updated: 16 Oct 2024

NVIDIA H100 GPUs On-Demand

Sign up/Login

As advanced LLMs like Llama 3.1-70B and Qwen2-72B scale in size and complexity, network efficiency becomes a bottleneck. Hyperstack’s recently released high-speed networking with SR-IOV brings a new dimension to addressing these challenges. SR-IOV allows multiple virtual machines (VMs) to share the same physical NIC (network interface card) while maintaining high-speed and low-latency communication. Continue reading this blog as we explore how SR-IOV improves fine-tuning and inference for LLM workloads.

Challenges in Multi-Node LLM Fine-Tuning

When fine-tuning LLMs across multiple nodes, one of the primary challenges is data transfer efficiency between virtual machines. Traditional networking setups, such as VirtIO, which delivers speeds of 10 Gbps are not sufficient for the high-volume and low-latency data exchanges that large language models require. In a distributed fine-tuning setup, model weights, gradients and datasets need to be transferred between nodes continuously. This often results in bottlenecks, with nodes waiting for data from others which increases total training time. Similarly, AI inference at scale also suffers from these limitations, especially when models are deployed across multiple VMs to handle large-scale traffic.

How SR-IOV Enhances Inter-VM Network Speeds

SR-IOV resolves many of these networking challenges by enabling direct hardware access for VMs, bypassing the software layer that slows down traditional network virtualisation like VirtIO. This direct access boosts network throughput that allows data transfers between VMs to happen at speeds up to 350 Gbps- a stark difference to that of 10 Gbps of VirtIO, as seen below in the Iperf tests conducted within Hyperstack’s environment.

SR-IOV Benchmarks

According to the benchmarking figures:

  • VirtIO (VM with virtio-net vNIC): Peaks at 10.5 Gbps with 1-thread Iperf tests.
  • SR-IOV (VM with SR-IOV VFLag NIC): Starts at 37.1 Gbps with 1-thread, ramping up to 349 Gbps with 24-thread tests.

This increase in network speeds makes SR-IOV a huge turning point for multi-node LLM fine-tuning tasks where VMs constantly exchange large amounts of data.

SR-IOV Benefits for Multi-Node Fine-Tuning of LLMs

The key benefits of multi-node fine-tuning of LLMs are:

Faster Data Transfer Across VMs

The multi-thread performance of SR-IOV significantly reduces inter-node communication delays. During LLM fine-tuning, where each node in the cluster trains on a subset of data, the ability to share updates, gradients and model checkpoints faster between VMs cuts down training time and allows more efficient scaling across nodes.

For instance, in scenarios where hyperparameters are continuously adjusted or where model updates need to be synchronised across multiple GPUs, the quick data movement offered by SR-IOV reduces waiting periods between nodes, accelerating the overall fine-tuning process.

Reduced Bottlenecks in Multi-Node Training

By eliminating the bottlenecks that arise from slow network communication, SR-IOV enhances data flow between nodes, even in complex LLM architectures. Traditional VirtIO setups cause significant delays when training large models like GPT or Llama across multiple VMs, as data transmission between these nodes is a critical component of large AI model training efficiency.

Efficient Resource Utilisation for Large-Scale Models

One key advantage of SR-IOV is its ability to more efficiently allocate network and compute resources across VMs. SR-IOV allows multiple VMs to share the same physical NIC without compromising on bandwidth or latency. This ensures that network speeds remain optimal even when multiple models or datasets are being processed concurrently. If you are working with distributed LLMs, this results in better utilisation of GPU resources and faster model convergence.

Impact on Inference

While network speed is often associated with multi-node training, SR-IOV also holds potential benefits for LLM inference workloads, especially in distributed setups. Although the distance largely influences inference latency and connection from the data centre to the user, SR-IOV’s ability to accelerate inference calculations by reducing data transfer times between VMs can indirectly improve response times.

In distributed inference, where model partitions or ensemble models run on different VMs, faster data sharing between VMs means the model can respond quickly to a user query. While this doesn’t directly reduce network latency to the user, it ensures that the processing of the request within the data centre happens faster which could lead to quicker responses.

Conclusion

The release of our on-demand high-speed networking with SR-IOV is a significant upgrade for our users, offering high-performance networking to handle even the most demanding workloads like fine-tuning LLMs. With speeds up to 350 Gbps, SR-IOV can accelerate multi-node training and inference for faster data transfer and reduced bottlenecks. As we continue to innovate, we remain committed to introducing new features that help optimise AI workloads and drive more efficient and scalable solutions for our users.

Did you miss our previous parts? Give them a read today👇

FAQs

How does SR-IOV improve LLM fine-tuning?

SR-IOV reduces data transfer times between nodes, accelerating multi-node fine-tuning processes.

Can SR-IOV reduce inference latency?

While it doesn't directly reduce user-to-data centre latency, it speeds up inference calculations by improving inter-VM data flow.

Which GPUs support SR-IOV on Hyperstack?

SR-IOV is available on Hyperstack GPUs like NVIDIA H100 PCIe, NVIDIA H100 PCIe with NVLink and NVIDIA A100 with NVLink.

Subscribe to Hyperstack!

Enter your email to get updates to your inbox every week

Get Started

Ready to build the next big thing in AI?

Sign up now
Talk to an expert

Share On Social Media

Hyperstack - Case Studies link

25 Sep 2024

With the public cloud market expected to reach over $1 trillion by 2026, optimising cloud ...

Hyperstack - Case Studies link

17 Jul 2024

Artificial intelligence was a long shot for many businesses because it was too complex, ...