💾 Archived View for dioskouroi.xyz › thread › 24945361 captured on 2020-10-31 at 01:03:11. Gemini links have been rewritten to link to archived content
-=-=-=-=-=-=-
________________________________________________________________________________
Sorry about the slightly-clickbaity title. I actually have at least a 10 GbE card (and switch) on the way to test those and see if I can get more out of it, but for _this_ test, I had a 4-interface Intel I340-T4, and I managed to get a maximum throughput of 3.06 Gbps when pumping bits through all 4 of those plus the built-in Gigabit interface on the Compute Module.
For some reason I couldn't break that barrier, even though all the interfaces can do ~940 Mbps on their own, and any three on the PCIe card can do ~2.8 Gbps. It seems like there's some sort of upper limit around 3 Gbps on the Pi CM4 (even when combining the internal interface) :-/
But maybe I'm missing something in the Pi OS / Debian/Linux kernel stack that is holding me back? Or is it a limitation on the SoC? I though the ethernet chip was separate from the PCIe lanes on it, but maybe there's something internal to the BCM2711 that's bottlenecking it.
Also... tons more detail here:
https://github.com/geerlingguy/raspberry-pi-pcie-devices/iss...
Its a single lane pcie gen2 interface. The max theoretical is 500MB/sec. So you can't ever touch 10G with it. In reality getting 75% of theoretical on PCIe tends to be a rough upper limit on most PCIe interfaces, so the 3Gbit your seeing is pretty close to what one would expect.
edit: Oh its 3Gbit across 5 interfaces, one of which isn't PCIe, so the PCIe side is probably only running at about 50%. It might be interesting to see if the CPUs are pegged (or just one of them). Even so, PCIe on the rpi isn't coherent so that is going to slow things down too.
It looks like the problem is `ksoftirqd` gets pegged at 100% and the system just queues up packets, slowing everything down. See:
https://github.com/geerlingguy/raspberry-pi-pcie-devices/iss...
I would suggest you go ahead and try jumbo frames[0] as that will significantly decrease the CPU load and overhead.
I would also suggest using taskset[1] on each iperf server process to bind them each to a different cpu core.
Finally, I would suggest UDP on iperf and let the sending Pi's just completely saturate the link.
If you do all that, I think you have a good chance at achieving 3.5Gbps over just the Intel card.
0:
https://blah.cloud/hardware/test-jumbo-frames-working/
1:
https://linux.die.net/man/1/taskset
This is common even on x86 systems.
You have to set the irq affinity to utilize the available CPU cores.
There is a script included with the source you used to compile drivers called "set_irq_affinity"
Ex (Sets IRQ Affinity for all available cores) :
[path-to-i40epackage]/scripts/set_irq_affinity -x all ethX
So like
? — this didn't make a difference in the overall performance :(
Looks like the script needs to be adjusted to function on the Pi.
I wish I had the cycles and the kit on hand to play with this!
So, this is sorta indicative of a RSS problem, but on the rpi it could be caused by other things. Check /proc/interrupts to assure you have balanced MSI's, although that itself could be a problem too.
edit: run `perf top` to see if that gives you a better idea.
Results:
15.96% [kernel] [k] _raw_spin_unlock_irqrestore 12.81% [kernel] [k] mmiocpy 6.26% [kernel] [k] __copy_to_user_memcpy 6.02% [kernel] [k] __local_bh_enable_ip 5.13% [igb] [k] igb_poll
When it hit full blast, I started getting "Events are being lost, check IO/CPU overload!"
Another idea will be to increase interrupt coalescing via ethtool -c/C
>It might be interesting to see if the CPUs are pegged (or just one of them).
This is very likely the answer. I see a lot of people who think of the Pi as some kind of workhorse and are trying to use it for things that it simply can't do. The Pi is a great little piece of hardware, but it's not really made for this kind of thing. I'd never think about using a Raspberry Pi if I had to think about "saturating a NIC".
Well it can saturate up to two, and almost three, gigabit NICs now. So not too shabby.
But I like to know the limits so I can plan out a project and know whether I'm safe using a Pi, or a 3-5x more expensive board or small PC :)
>Sorry about the slightly-clickbaity title.
Well yes because 5Gbps Ethernet is actually a thing ( NBase-T or 5GBASE-T). So 1Gbps x 5 would be more accurate.
Cant wait to see results on 10GbE though :)
P.S I really wish 5Gbps Ethernet is more common.
My ATT router made by Nokia has one 5gbe and the fiber plugs in directly with SFP!
True true... though in my work trying to get a flexible 10 GbE network set up in my house, I've found that the support for 2.5 and 5 GbE are iffy at best on many devices :(
The "best" I've found so far (and gives you options to go 2.5GbE/5GbE
1:
https://www.amazon.com/UGREEN-Ethernet-Thunderbolt-Converter...
(USB-C)
2:
https://www.amazon.com/2-5GBase-T-Ethernet-Controller-Standa...
(Don't get the knock-off version of this, the brackets aren't the right sizes.) (PCIe)
The expensive ones I'm waiting to arrive:
3: Either a second hand Intel X520-DA1 card or the "refurb" from AliExpress
and
https://mikrotik.com/product/crs305_1g_4s_in
with RJ10 SFP+ modules. Then cry at how much you just spent.
I assume you saw the video with Plunkett the RPF put out (
specifically interesting at 10:45 ) - He mentioned he was testing 10GbE fibre and reached 3.2gbit. Now he goes into absolutely zero detail on that, but I find it interesting you've both hit the same ceiling.
(He also mentioned 390MB/sec write speed to nvme, which is suspiciously close to the same ceiling)
Yeah, I think the PCIe link hits a ceiling around there.
Note that combining the internal interface with the 4 NIC interfaces, and overclocking to 2.147 GHz got it up to 3.4 total Gbps. So the IRQ interrupts are the main bottleneck when it comes to total network packet throughout.
Since you're also from the midwest, I'll put it in terms you'll understand: :-)
> _I think the PCIe link hits a ceiling around there._
You're trying to shove 10 gallons of shit into a 5-gallon bucket!
--
I'm not sure how high you can set the MTU on those Pi's (the Intels should handle 9000) but I'd set them as high as they'll go, if I were you. An MTU of 9000 basically means ~1/6th the interrupts.
Awesome work. Been watching your videos on these (the video card one was especially interesting).
At what point are you saturating the poor little ARM CPU (or its tiny PCIe interface)?
Heh, I know that ~3 Gbps is the maximum you can get through the PCIe interface (x1, PCI 2.0), so that is expected. But I was hoping the internal ethernet interface was separate and could add one 1 Gbps more... the CPU didn't seem to be maxed out and was also not overheating at the time (especially not with my 12" fan blasting on it).
with some tuning you should be able to saturate the PCIe 1x slot.
Excellent reading on this available here :
http://www.intel.com/content/dam/doc/application-note/82575-...
and here :
https://blog.cloudflare.com/how-to-achieve-low-latency/
_Edit : with the inbound 10Gb card referenced_
Was all this TCP? You might try UDP as well, in case you're hitting a bottleneck in the tcp stack.
Jeff,
First off, thank you for doing this kind of 'r&d', it is really exciting to see what the Pi is capable of after less than a decade.
Would you be interested in someone testing a SAS PCI card? I'm going to pick up one of these as soon as they're not backordered...
Do you think an SFP+ nic would work? It would be cool to try out fiber.
There are no SFP option on 5gbps NICs as i understand as per standard
You might be hitting the limits of the RAM. I think LPDDR3 maxes out at ~4.2Gbps, and running other bus masters like the HDMI and OS itself would be cutting into that.
32-bit LPDDR4-3200 should give 12.8 Gbytes/s which is 102 Gbits/s.
You can't just multiply width*frequency for DRAM these days, as much as I wish we still lived in the days of ubiquitous SRAM.
The chip in some of the 2GB RPI4s is rated for only 3.7Gbps.
https://www.samsung.com/semiconductor/dram/lpddr4/K4F6E304HB...
No, that chip is rated for 3.7 Gbps _per pin_ and it's 32 bits wide. Even at ~60% efficiency you're an order of magnitude off.
Real world tests are seeing around 3 to 4 Gbps of memory bandwidth.
https://medium.com/@ghalfacree/benchmarking-the-raspberry-pi...
LPDDR cannot sustain anywhere near the max speed of the interface. It's more of a hope that you can burst something out and go to sleep rather than trying to maintain that speed. In a lot of ways DRAM hasn't gotten faster in decades when you look at how latency clocks nearly always increase at the same rate of interface speed increases. And LPDDR is the niche where that shines the most, because it doesn't have oodles of dies to interleave to hide that issue.
Innumeracy strikes again. It's actually 4-5 Gbytes/s [1] plus whatever bandwidth the video scanout is stealing (~400 Mbytes/s?). That's only ~40% efficient which is simultaneously terrible and pretty much what you'd expect from Broadcom. However 4 Gbytes/s is 32 Gbits/s which leaves plenty of headroom to do 5 Gbits/s of network I/O.
[1]
https://www.raspberrypi.org/forums/viewtopic.php?t=271121
Those numbers look way off, maybe they mixed up the units? Should be a few GBps at least.
Bits aren't bytes.
The y axis is labeled "megabits per second".
The y axis is wrong.
Is there a way to see if you are hitting memory bandwidth issues in Linux?
Not in a holistic way AFAIK, and for sure not rigged up to the Raspbian kernel (since all of that lives on the videocore side), but I bet Broadcom or the RPi foundation has access to some undocumented perf counters on the DRAM controller that could illuminate this if they were the ones debugging it.
Instead of lying and then apologizing once you get what you want, it would be better to just not lie in the first place.
Technically it's not a lie—there are 5x1 Gbps of interfaces here. But I wanted to acknowledge that I used a technicality to get the title how I wanted it, because if I didn't do that, a lot of people wouldn't read it, and then we wouldn't get to have this enlightening discussion ;)
You could hook up a 100gbs card, but that wouldn't make it 100gbs ethernet on a raspberry pi.
It would, but it wouldn't be able to push 100gbs. It's not lying
_So theoretically, 5 Gbps was possible_
No, it is not. That NIC is a PCIe Gen2 NIC. By using only a single lane, you're limiting the bandwidth to ~500MB/sec theoretical. That's 4Gb/s theoretical, and getting 3Gb/s is ~75% of the theoretical bandwidth, which is pretty decent.
I'll take pretty decent, then :)
I mean, before this the most I had tested successfully was a little over 2 Gbps with three NICs on a Pi 4 B.
Can you run an lspci -vvv on the Intel NIC? I just re-read things, and it seems like 1 of those Gb/s is coming from the on-board NIC. I'm curious if maybe PCIe is running at Gen1
Here you go!
So its running Gen2 x1, which is good. I was afraid that it might have downshifted to Gen1. Other threads point to your CPU being pegged, and I would tend to agree with that.
What direction are you running the streams in? In general, sending is much more efficient than receiving ("its better to give than to receive"). From your statement that ksoftirqd is pegged, I'm guessing you're receiving.
I'd first see what bandwidth you can send at with iperf when you run the test in reverse so this pi is sending. Then, to eliminate memory bw as a potential bottleneck, you could use sendfile. I don't think iperf ever supported sendfile (but its been years since I've used it). I'd suggest installing netperf on this pi, running netserver on its link partners, and running "netperf -tTCP_SENDFILE -H othermachine" to all 5 peers and see what happens.
Well, when a LAN is 1Gb/s they are actually not talking about real bits. It actually is 100MB/s max, not 125MB/s as one might expect. Back in the old days they used to call it baud.
This is wrong; 1 Gbps Ethernet is 125 MB/s (including headers/trailer and inter-packet gap so you only get ~117 in practice). Infiniband, SATA, and Fibre Channel cheat but Ethernet doesn't.
I think I've found the bottleneck now that I have the setup up and running again today—ksoftirqd quickly hits 100% CPU and stays that way until the benchmark run completes.
See:
https://github.com/geerlingguy/raspberry-pi-pcie-devices/iss...
You might want to try enabling jumbo frames by setting the MTU to something >1500 bytes. Doing so should reduce the number of IRQs per unit of time since each frame will be carrying more data and therefore there will be fewer of them.
According to the Intel 82580EB datasheet[1] it supports an MTU of "9.5KB." It's unclear if that means 9500 or 9728 bytes.
I looked briefly for a datasheet that includes the ethernet specs. of the Broadcom 2711 but didn't immediately find anything.
Recent versions of iproute2 can output the maximum MTU of an interface via:
# Look for "maxmtu" in the output ip -d link list
Barring that you can try incrementally upping the MTU until you run in to errors.
The MTU of an interface can be set via:
ip link set $interface mtu $mtu
Note that for symmetrical testing via direct crossover you'll want to have the MTU be the same on each interface pair.
[1]
https://www.intel.com/content/www/us/en/embedded/products/ne...
(pg. 25, "Size of jumbo frames supported")
I set the MTU to its max (just over 9000 on the intel, heh), but that didn't make a difference. The one thing that did move the needle was overclocking the CPU to 2.147 GHz (from base 1.5 GHz clock), and that got me to 3.4 Gbps. So it seems to be a CPU constraint at this point.
Did you change the mtu of the other sides as well? If not tcp will negotiate an mss that makes the larger mtu go unused.
Oh shoot, completely forgot to do that, as I was testing a few things one after the other and it slipped my mind. I'll have to try and see if I can get a little more out.
I wonder if using user-space tcp stack (or anything that could bypass the kernel) could push the number higher.
I would have a look at sending data with either DPDK (
https://doc.dpdk.org/burst-replay/introduction.html
) or AF_PACKET and mmap (
https://sites.google.com/site/packetmmap/
)
You can also use ethtool -C on the NICs on both ends of the connection to rate limit the irq signal handeling allowing you to optimize for throughput instead of latency.
Seems to be in the same ballpark as when I got ~3.09Gbps on the Pi4's PCIe, but on a single 10G link:
https://twitter.com/q3k/status/1225588859716632576
Oh, nice! How did I not find your tweets in all my searching around?
Shitposting on Twitter makes for bad SEO :).
I can get about 1.2-1.7 gigabit on the Pi 4 using a 2.5GBe USB NIC (Realtek). Some other testing shows the vendor driver to be faster, but when I tested it on a much faster ARM board, I can get the full 2.5GBe with the in-tree driver.
A much easier option:
Get a USB 3.0 2.5G or 5G card. With a fully functional DMA on the USB controller it can get quite close to PCIE option.
A setback for all Linux users at the moment:
The only chipmaker making USB NICs doing 2.5G+ is RealTek, and RealTek chose to use USB NCM API for their latest chips.
And as we know Linux support for NCM now is super slow, and buggy.
I barely got 120megs from it. Will welcome any kernel hacker taking on the problem.
> The only chipmaker making USB NICs doing 2.5G+ is RealTek, and RealTek chose to use USB NCM API for their latest chips.
QNAP QNA-UC5G1T uses Marvell AQtion AQC111U. Might be worth a try.
"I need four computers, and they all need gigabit network interfaces... where could I find four computers to do this?"
Why not loop the ports back to themselves? IIRC, 1gbit ports should autodetect when they're cross connected so it wouldn't even need special cables
When you loop back Ethernet links in the same computer, you need to take care with the configuration, because normally the operating system will not route the Ethernet packets through the external wires but will process them like being for localhost, so you will see a very large speed without any relationship with the Ethernet speed.
How to force the packets through the external wires depends on the operating system. On Linux you must use namespaces and assign the two Ethernet interfaces that are looped on each other to two distinct namespaces, then set appropriate routes.
Would that truly be able to test send / receive of a full (up to) gigabit of data to/from the interface? If it's loopback, it could test either sending 500 + receiving 500, or... sending 500 + receiving 500. It's like sending data through localhost, it doesn't seem to reflect a more real-world scenario (but could be especially helpful just for testing).
I think maybe they meant linking Port 1 to Port 2, and Port 3 to Port 4? Also I believe gigabit ethernet can be full duplex, so you should be able to send 1000 and receive 1000 on a single interface at the same time if it's in full duplex mode.
It's full-duplex, that's 1000 Mbps in each direction simultaneously.
It's probably outside the scope (and possibly cheating) but could a DPDK stack & supported nic[1] push you past the PCIe limit?
[1]
https://core.dpdk.org/supported/
Does DPDK actually let you not have to DMA packet data over to the system memory and back?
No you still have to send the data over the pcie link, but DPDK should basically offload all the network work to the nic, so that you are just streaming data to it. The kernel won't need to deal with IRQs or timing or anything like that.
I might be making things up, but I believe you can also run code on DPDK nics? i.e. beyond straight networking offload. If that's the case you could try compressing the data before you DMA it to the nic. This would make no sense normally, but if your bottleneck is in fact the pcie x1 link and you want to saturate the network, it would be something worth trying.
I mean really the whole thing is at most a fun exercise as the nic costs more than the pi.
Could be the first pi to mine crypto on a NIC. 30 years later...
Did you test without main Ethernet connection?
Yes.
Meaning using on board Ethernet will not increase or decrease the bandwidth?
It seems like there are two limits: PCIe bus up to about 3.2 Gbps, and total network bandwidth about 3 Gbps. So the total net bandwidth limits the 4x card, and also limits any combination of card interfaces and the built in interface (I tested many combos).
Overclocking can get the total net throughput to 3.4 Gbps.
That was a fun read. Thanks.