💾 Archived View for dctrud.randomroad.net › gemlog › 20200209-infiniband-part1.gmi captured on 2022-04-29 at 12:25:21. Gemini links have been rewritten to link to archived content
⬅️ Previous capture (2022-04-28)
-=-=-=-=-=-=-
Infiniband (IB) is a high speed, low latency networking fabric that sees heavy use in HPC/supercomputing. It's used in HPC because massively parallel programs running over large numbers of compute nodes depend heavily on being able to send and receive messages quickly (low latency), as well as requiring high bandwidth. IB has a large advantage over ethernet for latency, and also offers high bandwidth. The current versions of the Mellanox ConnectX IB cards are up to 200Gbps line speed, and cost mega-bucks. Infiniband switches are also pricey. It's possible, however, to setup a 2-node FDR Infiniband network at home for a little over $100, with some (significant) performance caveats. I come from a background in HPC and work on a project heavily used on some of the largest supercomputers, so have a personal and work interest in having IB around in my home office/lab.
IB HBAs (Host Bus Adapter cards) can be surprisingly cheap on ebay. I picked up 2 ConnectX3 dual port cards for $80. ConnectX3 cards max out at FDR IB speeds (56Gbps) and use the ~mlx_4~ driver in Linux. This driver is upstream supported by Mellanox, who are still selling the slightly newer ConnectX3-Pro cards. This means it should be possible to use a ConnectX3 usefully for some time yet, with something close to the 'production' software stack that you'd see on a supercomputer.
When buying cards be careful, as the ConnectX range includes cards that support Ethernet only (40GbE for ConnectX3), as well as those which support both Inifiniband and Ethernet. These dual-mode, especially dual-port VPI cards are great. You can set one port to IB and one port to 40GbE if you want to play with both forms of networking.
Cards are relatively cheap and easy to get hold of compared to useful cables. IB cables come in copper and fiber forms, with integrated QSFP connectors. Cables are specified for a specific speed, with newer and faster ones getting expensive quickly. It is not unusual to spend $100s for a single new cable. Setting up a 2-node network doesn't require a switch, as a single cable can be used to directly connect two cards together. The closer your machines are together, the cheaper it'll be. I was able to pick up a 1m FDR10 cable for $22. 'FDR10' is the speed rating. It indicates this cable can't do the full 56Gbps speed of FDR IB. Instead, it'll be limited to 40Gbps. Each QSFP cable carries 4 physical links... of 10GbE in this case (FDR10)... so we get 40Gbps max.
For my setup the fact that my cable can't work at the card's full speed doesn't matter, due to the limitations of my host systems. I could go cheaper and buy older QDR or DDR cables, but I wanted to be able to run at 40Gbps rates with FDR10 if I eventually get the cards into systems that support it (read on...).
Don't be tempted to go really really cheap and buy something just listed as "Infiniband" and not DDR/QDR/FDR. Original IB uses a different connector. You will need a QSFP cable for a ConnectX3 card.
My home lab IB network is between my main Ryzen 1600X desktop machine, and a used Dell Optiplex 3020 core i5-4590 desktop I bought from ebay for $90 a while ago.
In the Ryzen desktop I have a Radeon GPU in the main PCIe 3.0 x16 socket, which is a high speed link direct to the CPU. On the B350 chipset motherboard there's one additional socket suitable for the IB card. However, it's driven by the chipset at PCIe 2.0 speeds, and is x4 (4 lanes). Although the ConnectX3 card is fairly old now it's a serious piece of kit that needs PCIe 3.0 x8 to drive it at full speed. 'dmesg' on the Ryzen desktop shows what I'll be limited to:
[ 16.103394] mlx4_core 0000:08:00.0: 16.000 Gb/s available PCIe bandwidth, limited by 5 GT/s x4 link at 0000:03:04.0 (capable of 63.008 Gb/s with 8 GT/s x8 link)
That 16.000 Gb/s PCIe bandwidth means I won't get close to the limit of my 40Gbps FDR10 cable, let alone the 56Gbps full-speed of the cards. It's still a lot faster than my normal 1GbE though!
The cheap old Optiplex 3020 is actually a bit better off, as I can plug the card into the 'primary' slot that would usually be used for a GPU, and stick with the onboard graphics.
[ 8.358562] mlx4_core 0000:01:00.0: 32.000 Gb/s available PCIe bandwidth, limited by 5 GT/s x8 link at 0000:00:01.0 (capable of 63.008 Gb/s with 8 GT/s x8 link)
For my purposes, 16Gbps max is fine. However, if you want to mess around with Infiniband for its performance rather than to simply test compatibility / gain experience, you need to pay attention to the PCIe slots available to you, their generation and number of lanes. It would often work out cheaper to buy relatively old server-grade hardware that has an abundance of PCIe slots and lanes, than to piece together modern desktop hardware that would give you full IB line rate. Another advantage of server hardware is that it's more likely to support SR-IOV. This allows a single IB card to present multiple virtual PCIe devices that can be assigned to virtual machines. If you can use SR-IOV you can have many VMs talking over IB protocols without needing as many physical cards as you have VMs. SR-IOV is making an appearance on some desktop chipsets, but is still rare and it's not often documented as supported/unsupported.
Once the cards are installed the IB cables plug into them with a satisfying click. Make sure that you are careful not to bend the cables too tightly. Copper IB cables are quite delicate, and if you bend them too tightly and kink them they are toast. A 1m FDR cable is fairly thin, but if you spend out on something newer like EDR (100Gb), or buy longer copper cables (up to 7m) then you'll have a very thick awkward to run cable. I can tell you from first-hand experience that running 32 IB cables neatly in a rack of HPC nodes is not fun. A lot of care is used as pulling one too tight means a lot of money down the drain even at used prices.
Next time I'll write a bit about setting up the basic software stack.