5Gbps Ethernet on the Raspberry Pi Compute Module 4

Author: geerlingguy

Score: 187

Comments: 76

Date: 2020-10-30 18:25:59

________________________________________________________________________________

geerlingguy wrote at 2020-10-30 18:29:23:

Sorry about the slightly-clickbaity title. I actually have at least a 10 GbE card (and switch) on the way to test those and see if I can get more out of it, but for _this_ test, I had a 4-interface Intel I340-T4, and I managed to get a maximum throughput of 3.06 Gbps when pumping bits through all 4 of those plus the built-in Gigabit interface on the Compute Module.

For some reason I couldn't break that barrier, even though all the interfaces can do ~940 Mbps on their own, and any three on the PCIe card can do ~2.8 Gbps. It seems like there's some sort of upper limit around 3 Gbps on the Pi CM4 (even when combining the internal interface) :-/

But maybe I'm missing something in the Pi OS / Debian/Linux kernel stack that is holding me back? Or is it a limitation on the SoC? I though the ethernet chip was separate from the PCIe lanes on it, but maybe there's something internal to the BCM2711 that's bottlenecking it.

Also... tons more detail here:

https://github.com/geerlingguy/raspberry-pi-pcie-devices/iss...

StillBored wrote at 2020-10-30 20:07:48:

Its a single lane pcie gen2 interface. The max theoretical is 500MB/sec. So you can't ever touch 10G with it. In reality getting 75% of theoretical on PCIe tends to be a rough upper limit on most PCIe interfaces, so the 3Gbit your seeing is pretty close to what one would expect.

edit: Oh its 3Gbit across 5 interfaces, one of which isn't PCIe, so the PCIe side is probably only running at about 50%. It might be interesting to see if the CPUs are pegged (or just one of them). Even so, PCIe on the rpi isn't coherent so that is going to slow things down too.

geerlingguy wrote at 2020-10-30 20:39:46:

It looks like the problem is `ksoftirqd` gets pegged at 100% and the system just queues up packets, slowing everything down. See:

https://github.com/geerlingguy/raspberry-pi-pcie-devices/iss...

InvaderFizz wrote at 2020-10-30 23:41:35:

I would suggest you go ahead and try jumbo frames[0] as that will significantly decrease the CPU load and overhead.

I would also suggest using taskset[1] on each iperf server process to bind them each to a different cpu core.

Finally, I would suggest UDP on iperf and let the sending Pi's just completely saturate the link.

If you do all that, I think you have a good chance at achieving 3.5Gbps over just the Intel card.

https://blah.cloud/hardware/test-jumbo-frames-working/

https://linux.die.net/man/1/taskset

dualboot wrote at 2020-10-30 21:08:05:

This is common even on x86 systems.

You have to set the irq affinity to utilize the available CPU cores.

There is a script included with the source you used to compile drivers called "set_irq_affinity"

Ex (Sets IRQ Affinity for all available cores) :

[path-to-i40epackage]/scripts/set_irq_affinity -x all ethX

geerlingguy wrote at 2020-10-30 21:12:03:

So like

https://pastebin.com/2Z4UECPq

? — this didn't make a difference in the overall performance :(

dualboot wrote at 2020-10-30 21:20:35:

Looks like the script needs to be adjusted to function on the Pi.

I wish I had the cycles and the kit on hand to play with this!

StillBored wrote at 2020-10-30 20:45:41:

So, this is sorta indicative of a RSS problem, but on the rpi it could be caused by other things. Check /proc/interrupts to assure you have balanced MSI's, although that itself could be a problem too.

edit: run `perf top` to see if that gives you a better idea.

geerlingguy wrote at 2020-10-30 21:01:46:

Results:

        15.96%  [kernel]                      [k] _raw_spin_unlock_irqrestore
    12.81%  [kernel]                      [k] mmiocpy
     6.26%  [kernel]                      [k] __copy_to_user_memcpy
     6.02%  [kernel]                      [k] __local_bh_enable_ip
     5.13%  [igb]                         [k] igb_poll

When it hit full blast, I started getting "Events are being lost, check IO/CPU overload!"

SoapSeller wrote at 2020-10-30 21:09:12:

Another idea will be to increase interrupt coalescing via ethtool -c/C

leptons wrote at 2020-10-30 21:45:45:

>It might be interesting to see if the CPUs are pegged (or just one of them).

This is very likely the answer. I see a lot of people who think of the Pi as some kind of workhorse and are trying to use it for things that it simply can't do. The Pi is a great little piece of hardware, but it's not really made for this kind of thing. I'd never think about using a Raspberry Pi if I had to think about "saturating a NIC".

geerlingguy wrote at 2020-10-30 22:21:07:

Well it can saturate up to two, and almost three, gigabit NICs now. So not too shabby.

But I like to know the limits so I can plan out a project and know whether I'm safe using a Pi, or a 3-5x more expensive board or small PC :)

ksec wrote at 2020-10-30 19:42:10:

>Sorry about the slightly-clickbaity title.

Well yes because 5Gbps Ethernet is actually a thing ( NBase-T or 5GBASE-T). So 1Gbps x 5 would be more accurate.

Cant wait to see results on 10GbE though :)

P.S I really wish 5Gbps Ethernet is more common.

ncrmro wrote at 2020-10-30 19:45:25:

My ATT router made by Nokia has one 5gbe and the fiber plugs in directly with SFP!

geerlingguy wrote at 2020-10-30 20:07:52:

True true... though in my work trying to get a flexible 10 GbE network set up in my house, I've found that the support for 2.5 and 5 GbE are iffy at best on many devices :(

voltagex_ wrote at 2020-10-31 02:41:51:

The "best" I've found so far (and gives you options to go 2.5GbE/5GbE

https://www.amazon.com/UGREEN-Ethernet-Thunderbolt-Converter...

(USB-C)

https://www.amazon.com/2-5GBase-T-Ethernet-Controller-Standa...

(Don't get the knock-off version of this, the brackets aren't the right sizes.) (PCIe)

The expensive ones I'm waiting to arrive:

3: Either a second hand Intel X520-DA1 card or the "refurb" from AliExpress

and

https://mikrotik.com/product/crs305_1g_4s_in

with RJ10 SFP+ modules. Then cry at how much you just spent.

soneil wrote at 2020-10-30 23:33:24:

I assume you saw the video with Plunkett the RPF put out (

https://youtu.be/yiHgmNBOzkc

specifically interesting at 10:45 ) - He mentioned he was testing 10GbE fibre and reached 3.2gbit. Now he goes into absolutely zero detail on that, but I find it interesting you've both hit the same ceiling.

(He also mentioned 390MB/sec write speed to nvme, which is suspiciously close to the same ceiling)

geerlingguy wrote at 2020-10-31 00:34:27:

Yeah, I think the PCIe link hits a ceiling around there.

Note that combining the internal interface with the 4 NIC interfaces, and overclocking to 2.147 GHz got it up to 3.4 total Gbps. So the IRQ interrupts are the main bottleneck when it comes to total network packet throughout.

jlgaddis wrote at 2020-10-31 03:24:51:

Since you're also from the midwest, I'll put it in terms you'll understand: :-)

> _I think the PCIe link hits a ceiling around there._

You're trying to shove 10 gallons of shit into a 5-gallon bucket!

I'm not sure how high you can set the MTU on those Pi's (the Intels should handle 9000) but I'd set them as high as they'll go, if I were you. An MTU of 9000 basically means ~1/6th the interrupts.

mmastrac wrote at 2020-10-30 18:38:57:

Awesome work. Been watching your videos on these (the video card one was especially interesting).

At what point are you saturating the poor little ARM CPU (or its tiny PCIe interface)?

geerlingguy wrote at 2020-10-30 18:45:13:

Heh, I know that ~3 Gbps is the maximum you can get through the PCIe interface (x1, PCI 2.0), so that is expected. But I was hoping the internal ethernet interface was separate and could add one 1 Gbps more... the CPU didn't seem to be maxed out and was also not overheating at the time (especially not with my 12" fan blasting on it).

dualboot wrote at 2020-10-30 19:07:34:

with some tuning you should be able to saturate the PCIe 1x slot.

Excellent reading on this available here :

http://www.intel.com/content/dam/doc/application-note/82575-...

and here :

https://blog.cloudflare.com/how-to-achieve-low-latency/

_Edit : with the inbound 10Gb card referenced_

toast0 wrote at 2020-10-30 20:32:22:

Was all this TCP? You might try UDP as well, in case you're hitting a bottleneck in the tcp stack.

stratosmacker wrote at 2020-10-30 20:23:04:

Jeff,

First off, thank you for doing this kind of 'r&d', it is really exciting to see what the Pi is capable of after less than a decade.

Would you be interested in someone testing a SAS PCI card? I'm going to pick up one of these as soon as they're not backordered...

wil421 wrote at 2020-10-30 19:07:02:

Do you think an SFP+ nic would work? It would be cool to try out fiber.

baybal2 wrote at 2020-10-30 21:35:38:

There are no SFP option on 5gbps NICs as i understand as per standard

monocasa wrote at 2020-10-30 19:38:08:

You might be hitting the limits of the RAM. I think LPDDR3 maxes out at ~4.2Gbps, and running other bus masters like the HDMI and OS itself would be cutting into that.

wmf wrote at 2020-10-30 19:54:09:

32-bit LPDDR4-3200 should give 12.8 Gbytes/s which is 102 Gbits/s.

monocasa wrote at 2020-10-30 20:02:50:

You can't just multiply width*frequency for DRAM these days, as much as I wish we still lived in the days of ubiquitous SRAM.

The chip in some of the 2GB RPI4s is rated for only 3.7Gbps.

https://www.samsung.com/semiconductor/dram/lpddr4/K4F6E304HB...

wmf wrote at 2020-10-30 20:15:52:

No, that chip is rated for 3.7 Gbps _per pin_ and it's 32 bits wide. Even at ~60% efficiency you're an order of magnitude off.

monocasa wrote at 2020-10-30 20:25:27:

Real world tests are seeing around 3 to 4 Gbps of memory bandwidth.

https://medium.com/@ghalfacree/benchmarking-the-raspberry-pi...

LPDDR cannot sustain anywhere near the max speed of the interface. It's more of a hope that you can burst something out and go to sleep rather than trying to maintain that speed. In a lot of ways DRAM hasn't gotten faster in decades when you look at how latency clocks nearly always increase at the same rate of interface speed increases. And LPDDR is the niche where that shines the most, because it doesn't have oodles of dies to interleave to hide that issue.

wmf wrote at 2020-10-30 20:38:56:

Innumeracy strikes again. It's actually 4-5 Gbytes/s [1] plus whatever bandwidth the video scanout is stealing (~400 Mbytes/s?). That's only ~40% efficient which is simultaneously terrible and pretty much what you'd expect from Broadcom. However 4 Gbytes/s is 32 Gbits/s which leaves plenty of headroom to do 5 Gbits/s of network I/O.

[1]

https://www.raspberrypi.org/forums/viewtopic.php?t=271121

hedgehog wrote at 2020-10-30 20:37:08:

Those numbers look way off, maybe they mixed up the units? Should be a few GBps at least.

mlyle wrote at 2020-10-30 20:51:33:

Bits aren't bytes.

monocasa wrote at 2020-10-30 20:59:59:

The y axis is labeled "megabits per second".

Dylan16807 wrote at 2020-10-31 01:56:50:

The y axis is wrong.

mmastrac wrote at 2020-10-30 20:05:18:

Is there a way to see if you are hitting memory bandwidth issues in Linux?

monocasa wrote at 2020-10-30 20:07:36:

Not in a holistic way AFAIK, and for sure not rigged up to the Raspbian kernel (since all of that lives on the videocore side), but I bet Broadcom or the RPi foundation has access to some undocumented perf counters on the DRAM controller that could illuminate this if they were the ones debugging it.

CyberDildonics wrote at 2020-10-30 20:58:46:

Instead of lying and then apologizing once you get what you want, it would be better to just not lie in the first place.

geerlingguy wrote at 2020-10-30 21:05:24:

Technically it's not a lie—there are 5x1 Gbps of interfaces here. But I wanted to acknowledge that I used a technicality to get the title how I wanted it, because if I didn't do that, a lot of people wouldn't read it, and then we wouldn't get to have this enlightening discussion ;)

CyberDildonics wrote at 2020-10-30 23:00:34:

You could hook up a 100gbs card, but that wouldn't make it 100gbs ethernet on a raspberry pi.

IntelMiner wrote at 2020-10-31 03:58:11:

It would, but it wouldn't be able to push 100gbs. It's not lying

drewg123 wrote at 2020-10-30 20:08:58:

_So theoretically, 5 Gbps was possible_

No, it is not. That NIC is a PCIe Gen2 NIC. By using only a single lane, you're limiting the bandwidth to ~500MB/sec theoretical. That's 4Gb/s theoretical, and getting 3Gb/s is ~75% of the theoretical bandwidth, which is pretty decent.

geerlingguy wrote at 2020-10-30 20:34:03:

I'll take pretty decent, then :)

I mean, before this the most I had tested successfully was a little over 2 Gbps with three NICs on a Pi 4 B.

drewg123 wrote at 2020-10-30 20:52:42:

Can you run an lspci -vvv on the Intel NIC? I just re-read things, and it seems like 1 of those Gb/s is coming from the on-board NIC. I'm curious if maybe PCIe is running at Gen1

geerlingguy wrote at 2020-10-30 21:06:45:

Here you go!

https://pastebin.com/A8gsGz3t

drewg123 wrote at 2020-10-30 21:51:26:

So its running Gen2 x1, which is good. I was afraid that it might have downshifted to Gen1. Other threads point to your CPU being pegged, and I would tend to agree with that.

What direction are you running the streams in? In general, sending is much more efficient than receiving ("its better to give than to receive"). From your statement that ksoftirqd is pegged, I'm guessing you're receiving.

I'd first see what bandwidth you can send at with iperf when you run the test in reverse so this pi is sending. Then, to eliminate memory bw as a potential bottleneck, you could use sendfile. I don't think iperf ever supported sendfile (but its been years since I've used it). I'd suggest installing netperf on this pi, running netserver on its link partners, and running "netperf -tTCP_SENDFILE -H othermachine" to all 5 peers and see what happens.

stkdump wrote at 2020-10-30 20:35:20:

Well, when a LAN is 1Gb/s they are actually not talking about real bits. It actually is 100MB/s max, not 125MB/s as one might expect. Back in the old days they used to call it baud.

wmf wrote at 2020-10-30 21:07:25:

This is wrong; 1 Gbps Ethernet is 125 MB/s (including headers/trailer and inter-packet gap so you only get ~117 in practice). Infiniband, SATA, and Fibre Channel cheat but Ethernet doesn't.

geerlingguy wrote at 2020-10-30 20:32:24:

I think I've found the bottleneck now that I have the setup up and running again today—ksoftirqd quickly hits 100% CPU and stays that way until the benchmark run completes.

See:

https://github.com/geerlingguy/raspberry-pi-pcie-devices/iss...

iscfrc wrote at 2020-10-30 21:52:34:

You might want to try enabling jumbo frames by setting the MTU to something >1500 bytes. Doing so should reduce the number of IRQs per unit of time since each frame will be carrying more data and therefore there will be fewer of them.

According to the Intel 82580EB datasheet[1] it supports an MTU of "9.5KB." It's unclear if that means 9500 or 9728 bytes.

I looked briefly for a datasheet that includes the ethernet specs. of the Broadcom 2711 but didn't immediately find anything.

Recent versions of iproute2 can output the maximum MTU of an interface via:

  # Look for "maxmtu" in the output
  ip -d link list

Barring that you can try incrementally upping the MTU until you run in to errors.

The MTU of an interface can be set via:

  ip link set $interface mtu $mtu

Note that for symmetrical testing via direct crossover you'll want to have the MTU be the same on each interface pair.

[1]

https://www.intel.com/content/www/us/en/embedded/products/ne...

(pg. 25, "Size of jumbo frames supported")

geerlingguy wrote at 2020-10-30 22:22:27:

I set the MTU to its max (just over 9000 on the intel, heh), but that didn't make a difference. The one thing that did move the needle was overclocking the CPU to 2.147 GHz (from base 1.5 GHz clock), and that got me to 3.4 Gbps. So it seems to be a CPU constraint at this point.

zamadatix wrote at 2020-10-30 23:11:05:

Did you change the mtu of the other sides as well? If not tcp will negotiate an mss that makes the larger mtu go unused.

geerlingguy wrote at 2020-10-31 00:35:31:

Oh shoot, completely forgot to do that, as I was testing a few things one after the other and it slipped my mind. I'll have to try and see if I can get a little more out.

neurostimulant wrote at 2020-10-30 22:56:17:

I wonder if using user-space tcp stack (or anything that could bypass the kernel) could push the number higher.

syoc wrote at 2020-10-30 20:49:30:

I would have a look at sending data with either DPDK (

https://doc.dpdk.org/burst-replay/introduction.html

) or AF_PACKET and mmap (

https://sites.google.com/site/packetmmap/

)

You can also use ethtool -C on the NICs on both ends of the connection to rate limit the irq signal handeling allowing you to optimize for throughput instead of latency.

q3k wrote at 2020-10-30 20:48:09:

Seems to be in the same ballpark as when I got ~3.09Gbps on the Pi4's PCIe, but on a single 10G link:

https://twitter.com/q3k/status/1225588859716632576

geerlingguy wrote at 2020-10-30 21:07:43:

Oh, nice! How did I not find your tweets in all my searching around?

q3k wrote at 2020-10-30 21:24:53:

Shitposting on Twitter makes for bad SEO :).

voltagex_ wrote at 2020-10-31 02:32:00:

I can get about 1.2-1.7 gigabit on the Pi 4 using a 2.5GBe USB NIC (Realtek). Some other testing shows the vendor driver to be faster, but when I tested it on a much faster ARM board, I can get the full 2.5GBe with the in-tree driver.

baybal2 wrote at 2020-10-30 19:10:54:

A much easier option:

Get a USB 3.0 2.5G or 5G card. With a fully functional DMA on the USB controller it can get quite close to PCIE option.

A setback for all Linux users at the moment:

The only chipmaker making USB NICs doing 2.5G+ is RealTek, and RealTek chose to use USB NCM API for their latest chips.

And as we know Linux support for NCM now is super slow, and buggy.

I barely got 120megs from it. Will welcome any kernel hacker taking on the problem.

vetinari wrote at 2020-10-30 20:13:48:

> The only chipmaker making USB NICs doing 2.5G+ is RealTek, and RealTek chose to use USB NCM API for their latest chips.

QNAP QNA-UC5G1T uses Marvell AQtion AQC111U. Might be worth a try.

unilynx wrote at 2020-10-30 19:29:32:

"I need four computers, and they all need gigabit network interfaces... where could I find four computers to do this?"

Why not loop the ports back to themselves? IIRC, 1gbit ports should autodetect when they're cross connected so it wouldn't even need special cables

adrian_b wrote at 2020-10-30 21:43:15:

When you loop back Ethernet links in the same computer, you need to take care with the configuration, because normally the operating system will not route the Ethernet packets through the external wires but will process them like being for localhost, so you will see a very large speed without any relationship with the Ethernet speed.

How to force the packets through the external wires depends on the operating system. On Linux you must use namespaces and assign the two Ethernet interfaces that are looped on each other to two distinct namespaces, then set appropriate routes.

geerlingguy wrote at 2020-10-30 19:35:00:

Would that truly be able to test send / receive of a full (up to) gigabit of data to/from the interface? If it's loopback, it could test either sending 500 + receiving 500, or... sending 500 + receiving 500. It's like sending data through localhost, it doesn't seem to reflect a more real-world scenario (but could be especially helpful just for testing).

nitrogen wrote at 2020-10-30 19:59:17:

I think maybe they meant linking Port 1 to Port 2, and Port 3 to Port 4? Also I believe gigabit ethernet can be full duplex, so you should be able to send 1000 and receive 1000 on a single interface at the same time if it's in full duplex mode.

jlgaddis wrote at 2020-10-31 03:43:35:

It's full-duplex, that's 1000 Mbps in each direction simultaneously.

escardin wrote at 2020-10-30 19:34:24:

It's probably outside the scope (and possibly cheating) but could a DPDK stack & supported nic[1] push you past the PCIe limit?

[1]

https://core.dpdk.org/supported/

q3k wrote at 2020-10-30 22:18:51:

Does DPDK actually let you not have to DMA packet data over to the system memory and back?

escardin wrote at 2020-10-31 01:45:40:

No you still have to send the data over the pcie link, but DPDK should basically offload all the network work to the nic, so that you are just streaming data to it. The kernel won't need to deal with IRQs or timing or anything like that.

I might be making things up, but I believe you can also run code on DPDK nics? i.e. beyond straight networking offload. If that's the case you could try compressing the data before you DMA it to the nic. This would make no sense normally, but if your bottleneck is in fact the pcie x1 link and you want to saturate the network, it would be something worth trying.

I mean really the whole thing is at most a fun exercise as the nic costs more than the pi.

geerlingguy wrote at 2020-10-31 02:06:43:

Could be the first pi to mine crypto on a NIC. 30 years later...

nojokes wrote at 2020-10-30 23:52:59:

Did you test without main Ethernet connection?

geerlingguy wrote at 2020-10-31 00:36:39:

Yes.

nojokes wrote at 2020-10-31 01:06:20:

Meaning using on board Ethernet will not increase or decrease the bandwidth?

geerlingguy wrote at 2020-10-31 01:25:29:

It seems like there are two limits: PCIe bus up to about 3.2 Gbps, and total network bandwidth about 3 Gbps. So the total net bandwidth limits the 4x card, and also limits any combination of card interfaces and the built in interface (I tested many combos).

Overclocking can get the total net throughput to 3.4 Gbps.

ProAm wrote at 2020-10-30 19:28:25:

That was a fun read. Thanks.