Welcome back to our SR-IOV series! In our previous post, we promised to provide the technical aspects of this technology. Today, we will offer a comprehensive look at Remote Direct Memory Access (RDMA) and its implementations along with some benchmark results. Let’s get started!
To recall, RDMA is a technology that allows networked computers to transfer data directly between their main memories bypassing the processor, cache and operating system. This helps in improved throughput and performance. This technology is quite similar to Direct Memory Access (DMA) but applied across networked systems. RDMA is particularly beneficial for networking and storage applications because it offers faster data transfer rates and lower latency, crucial factors to lead in the AI domain.
RDMA can be implemented using two primary technologies:
The key features of RDMA over Converged Ethernet (RoCE) are as mentioned:
Feature |
InfiniBand |
RoCE |
TCP/IP |
Scalability |
☑️ |
||
Performance |
☑️ |
☑️ |
|
Stability |
☑️ |
||
Cost |
☑️ |
☑️ |
|
Management |
☑️ |
☑️ |
The comparison table above uses checkmarks or ticks to indicate which technology is superior for each feature. Based on this, let's break down the comparison:
InfiniBand is designed for high scalability, particularly in large data center environments. It can support thousands of nodes with low latency and high bandwidth. RoCE, while scalable, may face some limitations in very large deployments compared to InfiniBand.
Both InfiniBand and RoCE use Remote Direct Memory Access (RDMA) technology, which allows for direct data transfer between the memory of different computers without involving the operating system. This results in lower latency and higher throughput compared to TCP/IP, which has more overhead due to protocol processing.
InfiniBand is known for its high stability and reliability, particularly in demanding environments like high-performance computing clusters. It has built-in error correction and flow control mechanisms. TCP/IP is also stable but may not match InfiniBand in high-stress scenarios. RoCE, being a newer technology, may face some stability challenges in certain implementations.
InfiniBand typically requires specialised hardware, which can be more expensive than standard Ethernet equipment used for RoCE and TCP/IP. RoCE leverages existing Ethernet infrastructure, making it more cost-effective. TCP/IP runs on standard networking equipment, usually making it the most affordable option.
TCP/IP is the most widely used networking protocol, with extensive tools and expertise available for management. RoCE, being based on Ethernet, benefits from some familiar management tools but may require additional expertise for RDMA configuration. InfiniBand often requires specialised knowledge and tools for management, making it potentially more complex to administer.
Now we will compare the RoCE versions:
To achieve optimal network performance, understanding the impact of various configurations on throughput and latency is imperative. We conducted a series of tests using iPerf, a widely used network testing tool to measure the bandwidth and performance of different network setups. Our focus was on comparing Baremetal with OpenFabrics Enterprise Distribution (OFED) stack, Virtual Machines (VM) with different Maximum Transmission Unit (MTU) settings and VMs configured with SR-IOV (Single Root I/O Virtualisation), VF-LAG (Virtual Function Link Aggregation Group) and NUMA (Non-Uniform Memory Access). The settings compared are as mentioned:
1. Baremetal with OpenFabrics Enterprise Distribution (OFED) stack
Provides direct access to hardware, leading to minimal overhead and high performance.
2. Virtual Machines (VM) with MTU 1500 (default Ethernet frame size)
The standard MTU size, typically 1500 bytes, includes the payload and headers.
3. Virtual Machines (VM) with MTU 9000 (Jumbo frames)
Jumbo frames increase the MTU size to 9000 bytes, allowing larger packets to be sent, which can reduce overhead and improve throughput.
4. Virtual Machines (VM) with SR-IOV VF-LAG and NUMA
We conducted network performance measurements using iPerf for evaluating and improving the speed and reliability of wired and wireless networks. Our tests spanned multiple configurations, each assessed with a range of thread counts: 1, 2, 4, 8, 10, 12, 16, 20, and 24. This comprehensive approach allowed us to gauge how different setups respond to varying levels of concurrent network activity.
Take a look into our measurements below. The numbers are mentioned in Gbps:
1T |
2T |
4T |
8T |
10T |
12 T |
16T |
20T |
24T |
|
1 |
18,9 |
39,2 |
93,9 |
176 |
190 |
225 |
313 |
381 |
395 |
2 |
10,5 |
8,89 |
8,68 |
8,52 |
8,45 |
8,41 |
8,4 |
8,23 |
8,23 |
3 |
16,2 |
17 |
14 |
14,1 |
13,2 |
13,8 |
14 |
13,3 |
13,4 |
4 |
37,1 |
67,8 |
122 |
199 |
197 |
262 |
290 |
331 |
349 |
In the performance benchmark results, we observed the following:
Don't miss our next post in the “SR-IOV Series” for a real-world example of SR-IOV in action! See how Meta leverages this technology for optimal performance. Get started now to explore our platform for more.
RDMA allows direct memory access between networked computers for improving throughput and performance.
RoCE uses kernel bypass and workload offloading to reduce CPU usage and increase data transfer speeds.
Baremetal with OFED Stack showed the highest throughput, peaking at 395 Gbps with 24 threads.
SR-IOV closely follows baremetal performance, outperforming it in the 1-12 thread range.