💾 Archived View for dioskouroi.xyz › thread › 29379792 captured on 2021-11-30 at 20:18:30. Gemini links have been rewritten to link to archived content
-=-=-=-=-=-=-
________________________________________________________________________________
I've maintained a QEMU fork with pinning support, and even coauthored a research paper on the Linux pinning performance topic, and the results have been... underwhelming; "sadly" the Linux kernel does a pretty good job at scheduling :)
I advise pinning users to carefully measure the supposed performance improvement, as there is a tangible risk of spending time on imaginary gains.
I found the most gains in terms of... latency consistency. I had a VM with a GPU passed through for gaming. With the cores appropriately pinned, especially away from host tasks, there were no more random DPC latency spikes.
With no pinning they'd randomly go into the milliseconds -- with pinning it would stay in the _micro_ second range!
The result of this is games (and likely audio) performing much more favorably.
How much of this is cache coherency/in-fighting, scheduling, or simply host usage; I couldn't tell you. I was just happy to have my VM 'feel' native.
There will always be a benefit with pinning vCPUs on the same NUMA nodes as their devices (VFIO or even SR-IOV). This is becoming increasingly important on hypervisors
In a setup with high-level of containers collocation on large ec2 instances, we've seen the opposite behavior at Netflix: default CFS performing badly. We've AB tested our flavor of custom pinning and measured substantial benefits:
https://netflixtechblog.com/predictive-cpu-isolation-of-cont...
PMC data at scale is pretty clear: very often, CFS won't do the right thing and will leave bad HT neighbors on the same core, leading to L1 thrashing, or keep a high-level of imbalance between NUMA sockets leading to degraded LLC hit rate.
Thanks, that's a very interesting case.
I correct my statement with "_did_ a good job", and appreciate rigorous testing.
Not sure how you maintaining QEMU makes you a credible source for evaluating a schedulers performance. It's apparent to me the performance of the scheduler is a function of the workload, so YMMV.
I worked on a project where we collected detailed production runtime characteristics and evaluated scheduler algorithms against it. Tiny improvements made for massive savings.
I definitely correct my "does" a good job with "did" a job. But ultimately, I've advised a good deal of caution, which I think is fair, in particular, considering that only a small fraction of the companies has a compute scale where tiny improvements make massive savings.
At my last job we initially saw performance loss due to pinning; I think multiple QEMU I/O threads got pinned to a single CPU. It's very easy to do it wrong.
I have looked around a bit, complicated to get right, very lite performance gains, most people doing it for gaming report
YMMV. We've seen M$ worth of cloud savings at Netflix doing pinning right. Knowing that the task scheduler is also heavily forked in Google's kernel, I'm ready to bet they've seen order of magnitude higher savings in their own DCs as well.
Agreed, in my case it became very useful on large boxes (96 physical cores). The performance gain was about 10%.
Would you mind sharing the paper on pinning? I'd be interested
Hello! I'll write you via email.
Kubernetes makes CPU pinning rather simple. Just need to meet conditions to reach Guaranteed QoS.
https://kubernetes.io/docs/tasks/administer-cluster/cpu-mana...
We are running lots of Erlang on k8s and CPU pinning improves performance of Erlang schedulers tremendously.
Is your setup open source? I'd love to know more about upsides of erlang/otp on top of k8s. Do you use hot code reloads?
Interesting. I would be curious to hear why pinning here improves performance. Is this something specific to the BEAM VM? Does this come at hit to K8S scheduler flexibility?
I don't have experience with k8s, but with BEAM on a traditional system, if BEAM is using the bulk of your CPU, you'll tend to get better results if each of the (main) BEAM scheduler threads is pinned to one CPU thread. Then all of the BEAM scheduler balancing can work properly. If both the OS and BEAM are trying to balance things, you can end up with a lot of extra task movement or extra lock contention when a BEAM thread gets descheduled by the OS to run a different BEAM thread that wants the same lock.
On most of the systems I ran, we didn't tend to have much of anything running on BEAMs dirty schedulers or other OS processes. If you have more of a mix of things, leaving things unpinned may work better.
Tangental, but does anyone know of a Windows utility for automatically pinning processes?
I like to keep up with several cryptocurrency prices on Coinbase, but the Coinbase Pro pages consume a pretty significant amount of CPU time. I'd love to be able to just shove all of those processes to a single CPU thread to reduce the impact on overall system performance.
I suppose it wouldn't be too hard to write a Python script that does this automatically...scan Window titles to look for "Coinbase Pro", find the owning PID, then call SetAffinity...
The windows task manager has the ability to set process affinity
Well, yeah, but I'm looking for a way to automate it. If I restart Firefox, all those affinities get reset.
Does anyone know how the methods mentioned by the author map to 'taskset'?
Or numactl, the latter is where this really starts to make a lot of sense. The perf improvements of keeping individual threads/processes pinned to a small core group (say sharing a L2 cache on Arm machines) tend to be fairly trivial in comparison to what happens when something gets migrated to a different numa node with a large latency to the memory/resident cache data.
CPU pinning is pretty useful for virtual machines, i.e. I've used it myself to improve performance on a VFIO setup, by limiting which cores where qemu runs on and thus improving cache locality.
https://wiki.archlinux.org/title/PCI_passthrough_via_OVMF#CP...
What are other real-world uses of CPU pinning?
Databases and other high-throughput data infrastructure software use CPU pinning, also HPC. The reasons are similar: higher cache locality, reduced latency, and more predictable scheduling. It is most useful when the process is taking over part or all of the resources of the machine anyway.
Memory and PCIE lanes in larger systems can be attached to particular CPUs, or to sub sections of a single CPU (i.e. AMD Threadrippers / Eypcs in particular) where traversing the the inter CPU / CCX links can cause latency or bandwidth issues.
The software will be pinned to CPU cores close to the RAM or PCIE device they are using.
Only really seen it be an issue in crazy large scale systems, or where you have 4 CPUs, but I haven't spent a huge amount of time on microsecond critical workloads.
Isn’t this particular issue partially solved with proper NUMA support in whatever kernel or scheduler is being used?
The Supernova audio server (
https://github.com/supercollider/supercollider/tree/develop/...
) pins each thread of its DSP thread pool to a dedicated core.
When implementing one-thread-per-core software architectures, explicit pinning is pretty much a requirement.
Much cheaper than CPU cgroups if you want some corse grained isolation when stacking workloads
CPU pinning can be particularly important if you're running virtual machines and/or hyperthreading-friendly workloads
Glad you mentioned hyperthreading. That can be easy to overlook. You reserved CPU 1 for a given workload? Did you remember CPU 49 as well?
The main point of HT is to reduce the cost of context switching by keeping twice the number of contexts close to the core. I would guess that parts of the process context like program counter, TLB, etc live inside the 'HT' and would have to be saved/restored every time the process moves between threads, even on the same core. Reserving both 'HT' on a core gets you cache locality, but isn't there a cost to moving the process back and forth, even if that data is in L1/L2?
(I'm looking at 'lstopo' from package 'hwloc', Linux on my Haswell Xeon: 10MB shared L3, 256KB L2, 32KB L1{d,i} per core)
Given my (educated) guess, I've told irqbalance to put interrupts only on 'thread 0' and then I schedule cpu-intensive tasks to 'thread 1' and schedule them very-not-nicely. Linux seems pretty good about keeping everything else on 'thread 0' when I have 'thread 1' busy so I don't do any further management.
I can have 4 cores 'thread 1' pegged at 100% with no impact on interactive or I/O performance.
In the context of the article, if you are trying to keep foreign processes "off my cores" then you can't neglect to keep them off the adjacent hyperthreads, because those share some of the resources. If you have 8 threads on 4 cores then at least the way Linux counts them cores 0 and 4 are sharing some caches and all backend execution resources. So if you have isolated core 0 but not core 4 you might as well have not done anything at all.
This makes sense in general, because the caches are the most precious resource.
However, in my case the working set is small enough and the processes are top-priority so they probably stay in the L2 if not the L1. Also ... I want to keep using my desktop so I don't mind the intrusion of my interactive processes.
Hmm. Is there a way to check how much L1/L2/L3 a process is occupying?
> _in my case the working set is small enough and the processes are top-priority so they probably stay in the L2 if not the L1._
Maybe! Maybe not. If it's top priority on core X but something else with a much better (or cache-unfriendly) dataset is on the hyperthread-sibling core then your high priority process can still have cache misses.
No, but it is possible on certain top-end Intel SKUs to partition the last-level caches such that they are effectively reserved to certain processes.
pqos?
Even RDT isn't going to give you insight into L1 occupancy.
True, however, CPU pinning is not the same as reserving/isolating the CPU. This is often not made clear in articles about CPU pinning.
This class looks great. I noticed the course page states:
"This class overlaps significantly with CS392 ``Systems Programming'' -- if you have taken this class, please talk to me in person before trying to register for CS631."[1]
Does anyone know if the videos for CS392 might also be online? I tried to some basic URL substitutions however I came up empty.
[1]