SR-IOV Series Part 3: Meta Case Study- Balancing RDMA and Ethernet for HPC

Written by Damanpreet Kaur Vohra | Jul 8, 2024 8:45:16 AM

The huge scale and complexity of modern AI models require equally massive computational resources. As AI models grow in size, they demand more processing power, faster data transfer rates and efficient resource utilisation. This is where the importance of infrastructure comes into play. The right infrastructure can mean the difference between a breakthrough AI model that transforms industries and one that remains perpetually in the training phase. It affects everything from the speed of model training and the accuracy of results to the cost-effectiveness of AI operations and the ability to scale research efforts.

For companies like Meta, which are leading AI research and development, having state-of-the-art infrastructure is not just an advantage, it's a prerequisite for success. It allows large AI model training, processes more data and iterate faster, that means more advanced AI and machine learning models that can be deployed across their range of products and services. Today, we'll explore Meta's case study on RDMA (Remote Direct Memory Access) and Ethernet for high-performance AI training clusters.

Also Read: Meta’s New Research Models to Boost AI: Here's All You Need to Know

The Challenge of Scale

Meta's AI infrastructure demands exceptional scalability to handle large-scale AI models like LLaMA 2 and LLaMA 3 which involve thousands of GPUs. To achieve the necessary performance levels, Meta has implemented both InfiniBand and Ethernet-based RDMA solutions, each clusters with an impressive 24,000 NVIDIA H100 GPUs.

InfiniBand vs. RoCE

InfiniBand offers low latency and high throughput which makes it a strong candidate for intensive AI tasks. However, it comes with higher costs and increased complexity. On the other hand, RoCE v2 sometimes allows the use of existing Ethernet infrastructure while providing RDMA capabilities which leads to cost savings and easier integration. The trade-off? RoCE faces challenges with congestion management and maintaining lossless communication which can impact performance during high traffic loads.

Key Tasks and Performance Indicators

The primary tasks handled by these clusters include:

Training Large Language Models: Processing huge datasets for models like LLaMA2 across thousands of GPUs.
Computer Vision: Handling image and video recognition tasks that require substantial computational power.
Multimodal AI: Integrating and processing data from different sources to train more versatile AI models.

To measure the performance and efficiency of these clusters, Meta focuses on several key performance indicators (KPIs):

Training Time: The time taken to train a large scale AI model, for example a model with 65 billion parameters in a day is a key target for Meta.
Latency: Latency is critical for tasks like inference where quick response times are necessary. Hence, Meta aims for the first token in inference to be processed in under one second, with subsequent tokens processed in milliseconds.
Bandwidth Utilisation: Ensuring high utilisation of available bandwidth between GPUs to avoid bottlenecks and maximise efficiency.
Scalability: The ability to scale up the number of GPUs and nodes without significant loss of performance is imperative. This involves optimising network traffic and job scheduling to maintain high performance as the cluster grows.
Resource Efficiency: Optimising power, cooling and physical space requirements.

Technical Challenges and Solutions

One of the most significant challenges with RoCE is ensuring lossless transport. This requires extensive buffering and complex congestion control mechanisms like Priority Flow Control (PFC) and Enhanced Congestion Notification (ECN). While these mechanisms help mitigate packet loss and manage traffic efficiently, they can lead to issues such as head-of-line blocking and PFC storms.

To address these challenges, Meta has implemented several optimisations:

Priority Flow Control (PFC)

PFC is designed to create a lossless Ethernet environment by pausing the transmission of packets when a receiver's buffer is full. However, this can cause several issues:

Excessive Buffering: PFC requires significant buffering to handle the round-trip time (RTT) of pause frames, which can waste valuable buffer space.
Head-of-Line Blocking: When one flow is paused, other flows that are not congested can also be paused, leading to inefficiencies.
PFC Storms: If multiple links are congested, it can create a cascading effect where multiple pause frames are sent, leading to widespread network congestion and potential deadlocks.

Enhanced Congestion Notification (ECN)

ECN helps in managing network congestion by marking packets instead of dropping them. When a packet is marked, the sender is notified to reduce the transmission rate. This helps in maintaining throughput without significant packet loss. However, ECN implementation requires:

Accurate Congestion Feedback: The feedback loop must be quick and accurate to prevent congestion from escalating.
Compatibility with Existing Infrastructure: Ensuring that all devices in the network support ECN and can interpret congestion marks correctly.

Other Optimisations

To address these challenges, Meta has implemented several optimisations:

Network Topology Awareness: Optimising job placement to minimise network congestion.
Software Optimisations: Adjusting network routing strategies and integrating with the NVIDIA Collective Communications Library (NCCL) to improve overall performance.
Job Scheduling: Implementing network topology-aware task scheduling to reduce latency and minimise traffic to upper network layers.

Meta is also exploring new protocols such as Ultra Ethernet Transport (UET) that promises improvements like multipath packet spraying and out-of-order packet delivery. These features aim to overcome current RDMA limitations and provide more robust solutions for large-scale AI workloads.

Conclusion

In conclusion, the race to build more powerful and efficient AI infrastructure continues with networking technologies playing an important role. As we've seen from Meta's case study, the choice between RDMA and Ethernet is not always straightforward but rather a careful balance of performance, cost and scalability considerations. Meta's AI clusters show that the cutting edge of AI infrastructure are designed to support the most demanding AI workloads. With a combination of advanced hardware, sophisticated networking and optimised software, these clusters can tdeliver high performance, scalability, and efficiency for training and deploying large AI models.

Interestingly, Meta sees benefits in both InfiniBand and RoCE technologies – InfiniBand for its reliability and RoCE for its efficiency. This aligns with our own approach at Hyperstack where we offer InfiniBand for our Supercloud and SR-IOV for other Hyperstack products. We understand that implementing advanced networking technologies like SR-IOV (Single Root I/O Virtualisation) can be complex. That's why we'll be providing configurations for SR-IOV enabled setups to optimise your AI workloads. You can also create a support ticket requesting SR-IOV configuration assistance.

Get started with Hyperstack to explore the full potential of your AI projects. Sign up now!

View full post