The growth of cloud computing has led to rising demand for robust and high-performance networking solutions. As more companies migrate their AI workloads to the cloud, the need for transferring vast amounts of data quickly between GPUs during large-scale AI model training has become imperative. However, to counter this need, we have two leading choices: RoCE and InfiniBand fabrics.
While InfiniBand has already been offered through our Supercloud, we are proud to introduce our latest innovation on Hyperstack i.e. High-Performance Ethernet networks for virtual machines (VMs) powered by Single Root I/O Virtualisation (SR-IOV). This innovation perfectly aligns with our mission to democratise cloud services. Our goal is to accelerate AI adoption through accessible, affordable, efficient, secure and sustainable solutions. Our SR-IOV promises to deliver super-fast data transfer rates and optimal resource utilisation for your AI workloads.
Single Root I/O Virtualisation (SR-IOV) is a technology that enables a single physical network device such as an Ethernet adapter to be shared among multiple VMs while maintaining near-native performance. SR-IOV achieves this by allowing the physical device to appear as multiple separate devices known as Virtual Functions (VFs). Each VF can be assigned directly to a VM bypassing the hypervisor's network stack. This reduces overhead and latency. So you don’t need to worry about CPU bottlenecks in tasks that require network connectivity.
RDMA stands for Remote Direct Memory Access. It is a technology that allows one computer to directly access the memory of another computer over a high-speed network without involving the operating system or processor of the remote system. This direct memory access eliminates the overhead associated with traditional network communication. Hence, you can achieve significantly lower latency and higher throughput
Building upon the foundation of RDMA, RoCE stands for RDMA over Converged Ethernet. It is a network protocol that enables RDMA over an Ethernet network. There are two versions of RoCE:
As modern applications, particularly in the AI domain, require complexity and scale. So the demand for high-performance and low-latency network solutions has become a necessity. While traditional virtual network interfaces do offer flexibility, they often commit significant overhead that leads to reduced throughput and increased latency.
Although InfiniBand has become a viable option, implementation of it can be costly and involve vendor lock-in. This significantly complicates the integration process. For example, Meta (formerly Facebook), the founder of the Open Compute Project, decided to build two 24k clusters: one with RoCE and another with InfiniBand. They optimised the RoCE cluster for quick build time, while the InfiniBand cluster was designed for full-bisection bandwidth. Interestingly, both clusters were utilised to train LLaMA 3 with the RoCE cluster being employed for training the largest model.
However, the challenges of large-scale AI model training extend far beyond mere connectivity. You may conclude it with hardware reliability, fast recovery on failure, efficient preservation of the training state and optimal connectivity between GPUs. These factors must be addressed to utilise the full potential of these applications.
In cloud computing where performance and scalability are paramount, SR-IOV offers numerous advantages that allow organisations to fully leverage their cloud environments:
At Hyperstack, we believe that SR-IOV, especially with RoCE v2, is a game-changer for cloud networking that’s why we propose to extend our offering beyond our existing Supercloud InfiniBand solution. The latest collaboration between Arista Networks and NVIDIA along with the increasing demand for high-performance networking solutions strengthens our stance on this. Our implementation of high-performance Ethernet networks for VMs leverages the power of SR-IOV to deliver unparalleled performance, reliability and efficiency. With the capabilities of RoCE v2, we enable direct access to the underlying hardware bypassing the traditional software-based network stack. This direct path eliminates unnecessary overhead resulting in significantly reduced latency and increased throughput – all factors critical for latency-sensitive applications and data-intensive workloads. Whether you're running AI inference, AI fine-tuning or real-time data analytics, our High-Performance Ethernet Networks for N3 Flavors with SR-IOV-enabled networks are designed to meet your specific requirements.
Check out our next post "SR-IOV Series Part 2: Technical Details of RDMA, RoCE and Performance Benchmarks".
Sign up now to explore our platform and see how it can benefit you.
Single Root I/O Virtualisation (SR-IOV) is a technology that enables a single physical network device to be shared among multiple VMs while maintaining near-native performance.
RoCE (RDMA over Converged Ethernet) leverages RDMA technology to minimise CPU involvement and reduce overhead, leading to significantly lower latency and higher throughput.
SR-IOV is relevant for modern applications, especially in AI, due to the increasing demand for high-performance, low-latency network solutions to handle complex workloads and large-scale model training.
By allowing direct access to the network hardware, SR-IOV can achieve throughput levels of up to 400 Gbps, compared to 8.5-17 Gbps with traditional Ethernet without SR-IOV.
Hyperstack's implementation of high-performance Ethernet networks with SR-IOV and RoCE v2 delivers unparalleled performance, reliability and efficiency while being cost-effective and accessible to organisations of all sizes.