💾 Archived View for dctrud.randomroad.net › gemlog › 20200307-infiniband-part2.gmi captured on 2022-04-29 at 11:40:06. Gemini links have been rewritten to link to archived content
⬅️ Previous capture (2022-04-28)
-=-=-=-=-=-=-
https://bugzilla.redhat.com/show_bug.cgi?id=1806981
This type of bug is not typical, in my experience.
In a post a month ago I wrote a bit about buying some FDR Infiniband network cards for my home lab setup. I use 2 machines networked with ConnectX3 FDR Infiniband cards to test things work with IB in HPC focused projects I'm involved in, and as a fast general use network between my workstation and a box acting as a server of sorts. Now let's take a look at the software stack.
If you've been around people using or adminstering HPC systems, where Infiniband is common and used with large MPI workloads, it might seem like it's very complicated to get an IB network setup correctly. Luckily it's now very straightforward in standard Linux distributions. IB networking requires kernel modules providing a driver for the card, and supporting native and IP protocols. There are also various libraries associated with using the native Infiniband 'verbs', RDMA etc. Many deployments use the OpenFabrics Enterprise Distribution (OFED) which bundles all of these things, packaging them up for enterprise Linux distributions such as Red Hat Enterprise Linux and SUSE Enterprise Server. Mellanox provides their own distribution (MLNX_OFED) which is further tested / optimized for their cards.
OFED and MLNX_OFED are only distributed for long-term supported 'enterprise' Linux distributions. You won't find an official OFED package for e.g. Arch Linux. Luckily all the component parts of OFED are open-source, and are packaged by distributions. If you're reading around on the net not that Mellanox refers to upstream drivers, rather than those distributed as part of their MLNX_OFED packages, as 'inbox' drivers.
I'm running Fedora 31 on the machines I want to use the Infiniband cards with. Mellanox do distribute a Fedora version of their OFED package, but it lags behind new Fedora releases. The current version of MLNX_OFED supports Fedora 30. Instead, I can install the distribution's drivers and libraries easily with:
sudo yum install libibverbs ucx-ib ucx-rdmacm opensm infiniband-diags
This will bring in a bunch of other packages covering most IB needs. On CentOS or Red Hat Enterpise Linux it's easier still as there's a group you can install:
sudo yum groupinstall "Infiniband Support"
Later on when you start doing much with programs that use IB natively you may get cryptic errors about not being able to allocate memory. Programs performing RDMA transfers etc. need to lock large regions of memory. The default security limits on most Linux distributions only allow non-root users to lock a small amount:
08:47 PM $ ulimit -a ... max locked memory (kbytes, -l) 65536 ...
You can increase this limit by setting a higher value in '/etc/security/limits.conf'. For simplicity I'm setting the hard and soft limits to unlimited by adding the lines:
Infiniband networks are co-ordinated by something called a subnet manager. This is a process that runs on a machine in the network, or on an IB switch. It discovers the topology of the network and manages the routes for traffic through the network etc. In my simple network consisting of 2 hosts connected with a single IB cable I still need a subnet manager running for the machines to be able to communicate. 'opensm' is the software subnet manager that we can run on a Linux system:
sudo systemctl enable opensm sudo systemctl start opensm
Now a subnet manager is running on one of the machines the lights on the cards of both should be active, and we can check the connection state with the ibstat command:
dave@ythel~> ibstat CA 'mlx4_0' CA type: MT4099 Number of ports: 2 Firmware version: 2.42.5000 Hardware version: 1 Node GUID: 0xf4521403007f2a10 System image GUID: 0xf4521403007f2a13 Port 1: State: Active Physical state: LinkUp Rate: 40 (FDR10) Base lid: 1 LMC: 0 SM lid: 1 Capability mask: 0x0259486a Port GUID: 0xf4521403007f2a11 Link layer: InfiniBand
We can see my card is now ~Active~ and the link speed is 40Gbps (since I have a cheaper 40Gbps FDR10 cable... though the cards are capable of a 56Gbps FDR link).
The `Base lid: 1` is the local identifier of this host within the network. The `SM lid: 1` shows where the subnet manager is running - on the same machine in this case. The `GUID` values are global IDs that would be important if we had multiple Infiniband networks and routed between them, but are not interesting on a small network.
We can use the 'ibnetdiscover' command to get an overview of the network, which shows 2 hosts:
dave@ythel~> sudo ibnetdiscover [sudo] password for dave: # # Topology file: generated on Sat Mar 7 21:25:30 2020 # # Initiated from node f4521403007f2a10 port f4521403007f2a11 vendid=0x2c9 devid=0x1003 sysimgguid=0x2c90300a51963 caguid=0x2c90300a51960 Ca 2 "H-0002c90300a51960" # "piran mlx4_0" [1](2c90300a51961) "H-f4521403007f2a10"[1] (f4521403007f2a11) # lid 2 lmc 0 "ythel mlx4_0" lid 1 4xFDR10 vendid=0x2c9 devid=0x1003 sysimgguid=0xf4521403007f2a13 caguid=0xf4521403007f2a10 Ca 2 "H-f4521403007f2a10" # "ythel mlx4_0" [1](f4521403007f2a11) "H-0002c90300a51960"[1] (2c90300a51961) # lid 1 lmc 0 "piran mlx4_0" lid 2 4xFDR10
Infiniband is different than the Ethernet with TCP/IP networking that we are
more familiar with. There's a nice overview of how communication works here:
https://blog.zhaw.ch/icclab/infiniband-an-introduction-simple-ib-verbs-program-with-rdma-write/
Luckily we can run an IP network over Infinband using ~IPoIB~ which allows us to
run any software that expects to talk to a host using an IP address and port, as
well as programs written to exploit IB's native low-latency verbs and RDMA
transfers.
These days an IPoIB interface can be setup easily using NetworkManager, e.g. via
the ~nmtui~ command or in the GUI. I chose to network my 2 machines as ~10.1.1.215~
and ~10.1.1.216~, on the ~10.1.1.0/24~ network. Your IB network must use a different
subnet to any existing ethernet interfaces, so that traffic is routed
correctly. I used ~nmtui~ to add a connection, chose the Infiniband option and
entered those IPv4 static IP details. The first port of the cards appears as the
device ~ibp1s0~ on both of my machines. ~ib0~ would be the more traditional device
name on systems that do not rename net devices by physical location in the
system.
After bringing up the IPoIB networking with 'nmtui' I can 'ping 10.1.1.216' from the machine that was setup as '10.1.1.215', and the reverse should work too:
dave@ythel~> ping 10.1.1.216 PING 10.1.1.216 (10.1.1.216) 56(84) bytes of data. 64 bytes from 10.1.1.216: icmp_seq=1 ttl=64 time=0.215 ms 64 bytes from 10.1.1.216: icmp_seq=2 ttl=64 time=0.173 ms 64 bytes from 10.1.1.216: icmp_seq=3 ttl=64 time=0.201 ms ^C --- 10.1.1.216 ping statistics --- 3 packets transmitted, 3 received, 0% packet loss, time 2087ms rtt min/avg/max/mdev = 0.173/0.196/0.215/0.017 ms
I can test the performance of the IPoIB networking with iperf3:
# Run a server on one machine dave@ythel~> iperf3 -s ----------------------------------------------------------- Server listening on 5201 ----------------------------------------------------------- ...
# Client on another machine dave@piran~> iperf3 -c 10.1.1.215 Connecting to host 10.1.1.215, port 5201 [ 5] local 10.1.1.216 port 42710 connected to 10.1.1.215 port 5201 [ ID] Interval Transfer Bitrate Retr Cwnd [ 5] 0.00-1.00 sec 1.44 GBytes 12.3 Gbits/sec 0 3.11 MBytes [ 5] 1.00-2.00 sec 1.43 GBytes 12.3 Gbits/sec 0 3.11 MBytes [ 5] 2.00-3.00 sec 1.44 GBytes 12.3 Gbits/sec 0 3.11 MBytes [ 5] 3.00-4.00 sec 1.43 GBytes 12.3 Gbits/sec 0 3.11 MBytes [ 5] 4.00-5.00 sec 1.43 GBytes 12.3 Gbits/sec 0 3.11 MBytes [ 5] 5.00-6.00 sec 1.43 GBytes 12.3 Gbits/sec 0 3.11 MBytes [ 5] 6.00-7.00 sec 1.43 GBytes 12.3 Gbits/sec 0 3.11 MBytes [ 5] 7.00-8.00 sec 1.43 GBytes 12.3 Gbits/sec 0 3.11 MBytes [ 5] 8.00-9.00 sec 1.43 GBytes 12.3 Gbits/sec 0 3.11 MBytes [ 5] 9.00-10.00 sec 1.43 GBytes 12.3 Gbits/sec 0 3.11 MBytes - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-10.00 sec 14.3 GBytes 12.3 Gbits/sec 0 sender [ 5] 0.00-10.00 sec 14.3 GBytes 12.3 Gbits/sec receiver
This shows that I have 12.3Gbits/sec using IP between my 2 Infiniband hosts. We know from the earlier hardware blog post that the card in my workstation is limited by available PCIe lanes to 16Gbps max. Given that's the maximum at the PCIe layer, and there is overhead in the IB stack and the IP stack, 12.3Gbits/sec seems pretty fast.
I can now go ahead and use my fast IB network just as I do my normal 1Gbps ethernet, but get 10x the speed by using the IP addresses associated with the Infiniband interfaces. When I go on to try out software that is written specifically to use native Infiniband verbs and RDMA instead of the IP layer I should get closer to the limit 16Gbps PCIe limit.