The huge scale and complexity of modern AI models require equally massive computational resources. As AI models grow in size, they demand more processing power, faster data transfer rates and efficient resource utilisation. This is where the importance of infrastructure comes into play. The right infrastructure can mean the difference between a breakthrough AI model that transforms industries and one that remains perpetually in the training phase. It affects everything from the speed of model training and the accuracy of results to the cost-effectiveness of AI operations and the ability to scale research efforts.
For companies like Meta, which are leading AI research and development, having state-of-the-art infrastructure is not just an advantage, it's a prerequisite for success. It allows large AI model training, processes more data and iterate faster, that means more advanced AI and machine learning models that can be deployed across their range of products and services. Today, we'll explore Meta's case study on RDMA (Remote Direct Memory Access) and Ethernet for high-performance AI training clusters.
Also Read: Meta’s New Research Models to Boost AI: Here's All You Need to Know
Meta's AI infrastructure demands exceptional scalability to handle large-scale AI models like LLaMA 2 and LLaMA 3 which involve thousands of GPUs. To achieve the necessary performance levels, Meta has implemented both InfiniBand and Ethernet-based RDMA solutions, each clusters with an impressive 24,000 NVIDIA H100 GPUs.
InfiniBand offers low latency and high throughput which makes it a strong candidate for intensive AI tasks. However, it comes with higher costs and increased complexity. On the other hand, RoCE v2 sometimes allows the use of existing Ethernet infrastructure while providing RDMA capabilities which leads to cost savings and easier integration. The trade-off? RoCE faces challenges with congestion management and maintaining lossless communication which can impact performance during high traffic loads.
The primary tasks handled by these clusters include:
To measure the performance and efficiency of these clusters, Meta focuses on several key performance indicators (KPIs):
One of the most significant challenges with RoCE is ensuring lossless transport. This requires extensive buffering and complex congestion control mechanisms like Priority Flow Control (PFC) and Enhanced Congestion Notification (ECN). While these mechanisms help mitigate packet loss and manage traffic efficiently, they can lead to issues such as head-of-line blocking and PFC storms.
To address these challenges, Meta has implemented several optimisations:
PFC is designed to create a lossless Ethernet environment by pausing the transmission of packets when a receiver's buffer is full. However, this can cause several issues:
ECN helps in managing network congestion by marking packets instead of dropping them. When a packet is marked, the sender is notified to reduce the transmission rate. This helps in maintaining throughput without significant packet loss. However, ECN implementation requires:
To address these challenges, Meta has implemented several optimisations:
Meta is also exploring new protocols such as Ultra Ethernet Transport (UET) that promises improvements like multipath packet spraying and out-of-order packet delivery. These features aim to overcome current RDMA limitations and provide more robust solutions for large-scale AI workloads.
In conclusion, the race to build more powerful and efficient AI infrastructure continues with networking technologies playing an important role. As we've seen from Meta's case study, the choice between RDMA and Ethernet is not always straightforward but rather a careful balance of performance, cost and scalability considerations. Meta's AI clusters show that the cutting edge of AI infrastructure are designed to support the most demanding AI workloads. With a combination of advanced hardware, sophisticated networking and optimised software, these clusters can tdeliver high performance, scalability, and efficiency for training and deploying large AI models.
Interestingly, Meta sees benefits in both InfiniBand and RoCE technologies – InfiniBand for its reliability and RoCE for its efficiency. This aligns with our own approach at Hyperstack where we offer InfiniBand for our Supercloud and SR-IOV for other Hyperstack products. We understand that implementing advanced networking technologies like SR-IOV (Single Root I/O Virtualisation) can be complex. That's why we'll be providing configurations for SR-IOV enabled setups to optimise your AI workloads. You can also create a support ticket requesting SR-IOV configuration assistance.
Get started with Hyperstack to explore the full potential of your AI projects. Sign up now!