<img alt="" src="https://secure.insightful-enterprise-intelligence.com/783141.png" style="display:none;">

Access NVIDIA H100s from just $2.06/hour. Reserve here

Deploy 8 to 16,384 NVIDIA H100 SXM GPUs on the AI Supercloud. Learn More

|

Published on 21 Jun 2024

SR-IOV Series Part 1: Introduction to High-Performance Ethernet Networks for N3 Flavours Based on SR-IOV

TABLE OF CONTENTS

updated

Updated: 3 Jul 2024

NVIDIA H100 GPUs On-Demand

Sign up/Login

The growth of cloud computing has led to rising demand for robust and high-performance networking solutions. As more companies migrate their AI workloads to the cloud, the need for transferring vast amounts of data quickly between GPUs during large-scale AI model training has become imperative. However, to counter this need, we have two leading choices: RoCE and InfiniBand fabrics.

While InfiniBand has already been offered through our Supercloud, we are proud to introduce our latest innovation on Hyperstack i.e. High-Performance Ethernet networks for virtual machines (VMs) powered by Single Root I/O Virtualisation (SR-IOV). This innovation perfectly aligns with our mission to democratise cloud services. Our goal is to accelerate AI adoption through accessible, affordable, efficient, secure and sustainable solutions. Our SR-IOV promises to deliver super-fast data transfer rates and optimal resource utilisation for your AI workloads.

What is SR-IOV?

Single Root I/O Virtualisation (SR-IOV) is a technology that enables a single physical network device such as an Ethernet adapter to be shared among multiple VMs while maintaining near-native performance. SR-IOV achieves this by allowing the physical device to appear as multiple separate devices known as Virtual Functions (VFs). Each VF can be assigned directly to a VM bypassing the hypervisor's network stack. This reduces overhead and latency. So you don’t need to worry about CPU bottlenecks in tasks that require network connectivity.

What is RDMA?

RDMA stands for Remote Direct Memory Access. It is a technology that allows one computer to directly access the memory of another computer over a high-speed network without involving the operating system or processor of the remote system. This direct memory access eliminates the overhead associated with traditional network communication. Hence, you can achieve significantly lower latency and higher throughput

What is RoCE?

Building upon the foundation of RDMA, RoCE stands for RDMA over Converged Ethernet. It is a network protocol that enables RDMA over an Ethernet network.  There are two versions of RoCE:

  1. RoCE v1: This initial version operates on Layer 2 Ethernet. This means that it is not routable and can only be used within the same Ethernet broadcast domain such as a single data center or subnet.
  2. RoCE v2: This version addresses the limitations of RoCE v1 as the v2 introduces improvements that improve overall performance and flexibility. By using a Converged Ethernet infrastructure, RoCE v2 enables the coexistence of traditional Ethernet traffic with RDMA traffic on the same network. This convergence streamlines network management and eliminates the need for a separate RDMA fabric. This makes RoCE v2 more accessible and cost-effective.

Why is SR-IOV Relevant for Modern Applications?

As modern applications, particularly in the AI domain, require complexity and scale. So the demand for high-performance and low-latency network solutions has become a necessity. While traditional virtual network interfaces do offer flexibility, they often commit significant overhead that leads to reduced throughput and increased latency.

Although InfiniBand has become a viable option, implementation of it can be costly and involve vendor lock-in. This significantly complicates the integration process. For example, Meta (formerly Facebook), the founder of the Open Compute Project, decided to build two 24k clusters: one with RoCE and another with InfiniBand. They optimised the RoCE cluster for quick build time, while the InfiniBand cluster was designed for full-bisection bandwidth. Interestingly, both clusters were utilised to train LLaMA 3 with the RoCE cluster being employed for training the largest model.

However, the challenges of large-scale AI model training extend far beyond mere connectivity. You may conclude it with hardware reliability, fast recovery on failure, efficient preservation of the training state and optimal connectivity between GPUs. These factors must be addressed to utilise the full potential of these applications.

Advantages of SR-IOV in Cloud Environments

In cloud computing where performance and scalability are paramount, SR-IOV offers numerous advantages that allow organisations to fully leverage their cloud environments:

  • Reduced Latency: SR-IOV significantly reduces network latency by bypassing the hypervisor's network stack. This makes it an ideal choice for latency-sensitive applications. The advantage is further amplified by the integration of RoCE which aims to bring similar low-latency benefits to Ethernet-based networks. By leveraging RDMA technology, RoCE minimises CPU involvement and reduces overhead. This helps in decreasing latency to unprecedented levels. Typical latency for RoCE v2 ranges from 2-5 microseconds, which is slightly higher than InfiniBand (1-2 microseconds) but still significantly lower than traditional Ethernet-based networking (10-20 microseconds).
  • Increased Throughput: Direct access to the network hardware facilitated by SR-IOV allows for a higher data transfer rates. This means you can improve the overall performance of virtual machines. While traditional Ethernet without SR-IOV typically achieves throughput ranging from 8.5-17 Gbps, with SR-IOV, organisations can achieve massive throughput levels of up to 400 Gbps.
  • Scalability: As cloud environments scale to meet the demands of modern applications, SR-IOV provides a reliable solution to maintain high network performance across a growing number of VMs. With the on-demand capabilities offered by Hyperstack, you can seamlessly accommodate bursts of batched workloads.
  • Cost-Efficiency: By maximising the performance of existing hardware, SR-IOV can reduce the need for additional network infrastructure. This leads to significant cost savings. Hyperstack's offering of free SR-IOV solutions also intensify the cost-effectiveness of this technology. This makes it accessible to organisations of all sizes and budgets.

SR-IOV Solutions on Hyperstack 

At Hyperstack, we believe that SR-IOV, especially with RoCE v2, is a game-changer for cloud networking that’s why we propose to extend our offering beyond our existing Supercloud InfiniBand solution. The latest collaboration between Arista Networks and NVIDIA along with the increasing demand for high-performance networking solutions strengthens our stance on this. Our implementation of high-performance Ethernet networks for VMs leverages the power of SR-IOV to deliver unparalleled performance, reliability and efficiency. With the capabilities of RoCE v2, we enable direct access to the underlying hardware bypassing the traditional software-based network stack. This direct path eliminates unnecessary overhead resulting in significantly reduced latency and increased throughput – all factors critical for latency-sensitive applications and data-intensive workloads. Whether you're running AI inference, AI fine-tuning or real-time data analytics, our High-Performance Ethernet Networks for N3 Flavors with SR-IOV-enabled networks are designed to meet your specific requirements. 


Don't miss out!

Check out our next post "SR-IOV Series Part 2: Technical Details of RDMA, RoCE and Performance Benchmarks".

Sign up now to explore our platform and see how it can benefit you.


FAQs

What is SR-IOV?

Single Root I/O Virtualisation (SR-IOV) is a technology that enables a single physical network device to be shared among multiple VMs while maintaining near-native performance.

What is the advantage of RoCE over traditional Ethernet?

RoCE (RDMA over Converged Ethernet) leverages RDMA technology to minimise CPU involvement and reduce overhead, leading to significantly lower latency and higher throughput.

Why is SR-IOV relevant for modern applications?

SR-IOV is relevant for modern applications, especially in AI, due to the increasing demand for high-performance, low-latency network solutions to handle complex workloads and large-scale model training.

How does SR-IOV improve throughput in cloud environments?

By allowing direct access to the network hardware, SR-IOV can achieve throughput levels of up to 400 Gbps, compared to 8.5-17 Gbps with traditional Ethernet without SR-IOV.

What is the advantage of Hyperstack's SR-IOV solution?

Hyperstack's implementation of high-performance Ethernet networks with SR-IOV and RoCE v2 delivers unparalleled performance, reliability and efficiency while being cost-effective and accessible to organisations of all sizes.

Subscribe to Hyperstack!

Enter your email to get updates to your inbox every week

Get Started

Ready to build the next big thing in AI?

Sign up now
Talk to an expert

Share On Social Media

Hyperstack - Company News link

19 Mar 2024

AI Supercloud will use NVIDIA Blackwell platform to drive enhanced efficiency, reduced ...

Hyperstack - Company News link

27 Feb 2024

AI Net Zero Collaboration to Power European AI London, United Kingdom – 26th February ...