(written by taggart, mistakes are his, please send corrections, work in progress, etc)
On multiproccessor systems you want to take advantage of all processors. However, moving a process from one CPU to another is often a net loss since the data that process is using is already in cache on that CPU, so moving it to another requires loading it again which is usually way more expensive than any benefit gained. So Linux’s default behavior is for processes to affine with the particular CPU they started on (rather than bounce around).
BUT interrupt processing doesn’t use much data, so spreading interrupts across CPUs doesn’t result in much cache misses, so it might be a win to spread them out across CPUs. BUT there is another problem…
For a network controller processing packets, state about the connection (like tcp objects, etc) is already loaded in the cache of one CPU. So if interrupts for the NIC move between CPUs that can be inefficient for the same reasons listed above. (more details)
There is a solution: MSI-X and modern NICs have the ability to have multiple queues per NIC, with each queue getting it’s own interrupt. Packets are assigned to queues according to a hash based on the pair of IP addresses/ports communicating. So then all packets for a particular connection will always go to the same queue/interrupt. Which allows us to map interrupts to particular CPUs and spread out the load without cache misses. (more details)
By default in Linux on most system, all hardware interrupts end up getting serviced by CPU 0. This can result in bottlenecks and reduced performance. The solution is that on x86 you have an APIC that can divide interrupt processing among the cores using smp affinity to assign interrupts to different cores.
In Linux you can use
/proc/irq/*/smp_affinity to set a bitmask to control which cpu’s interrupts will go to. The default is all bits set (some number of "f"’s depending on how many cpus you have) which should spread interrupts across all CPUs. However…
Systems APIC’s are setup by the BIOS and can operate in a couple different modes; “physical” where all interrupts get assigned to CPU0 by default and stay that way until you adjust smp_affinity, and “logical” where IRQs can “round-robin”. So the behavior you get is dependent on the chipset capabilities and what the system vendor decided to set in the BIOS.
You can determine which mode your IO-APIC is in by looking at what is reported in dmesg at boot. Unfortunately it appears to be chipset driver specific, but here are some examples of what to look for
Enabling APIC mode: Flat. Using 5 I/O APICs Setting APIC routing to flat
FIXME: but what does this mean?
Here are some articles 1, 2 that explains the problem and how to use SMP affinity.
Here are the bitmasks to select just a single CPU for the first 8 cpus.
The ideal solution would be for each machine to look at it’s hardware and workload and balance the interrupt processing for that machine by hand. This is what the people doing benchmarking and deploying highly tuned systems do. But most of us don’t have time to hand tune each hardware/workload configuration so…
irqbalance is a program designed to automatically balance interrupts for you. From the upstream site:
Irqbalance is the Linux utility tasked with making sure that interrupts from your hardware devices are handled in as efficient a manner as possible (meaning every cpu handles some of the interrupt work), while at the same time making sure that no single cpu is tasked with a inappropriately large amount of this work. Irqbalance constantly analyzes the amount of work interrupts require on your system and balances interrupt handling across all of your systems cpus in a fair manner, keeping system performance more predictable and constant. This should all happen transparently to the user. If irqbalance performs its job right, nobody will ever notice it's there or want to turn it off.
Sounds good right? Well the problem was that for a while it was buggy and earned a poor reputation. Upstream fixed many of the problems found in the 0.5x releases and the 1.x releases are supposed to be much improved. But if you search for information on irqbalance you will find lots of old blog entries talking about problems. Don’t let that scare you but also keep in mind that like all things magic, they don’t always do the right thing. So where you can, test without irqbalance and with a 1.0.x or newer release and watch /proc/interrupts, /proc/irq/*/smp_affinity, and use something like munin to graphs interrupts and cpu.
lkml thread on IRQ routing
An old thread on tuning network performance on multiple e1000 NICs.
blog article on how MSI-X helps solve this interrupt problem.
blog post saying irqbalance isn’t needed on modern AMD SMP cpus since they have dedicated L2 cache(?). But it also cites version 0.56, so if there is any real claim here, maybe 1.0.x addresses it?
article on installing irqbalance on Xen dom0’s
performance tuning for 10g NICs
Fedora bug explaining that running on systems with <= 2 cpus or 1 cache domain doesn’t make sense.
RHEL bug explaining the IRQBALANCE_BANNED_CPUS and IRQBALANCE_BANNED_INTERRUPTS options
The original 0.56 and older irqbalance upstream page.
ubuntu wiki page on Rescheduling Interrupts, mostly written from a laptop power savings perspective but still interesting.
FIXME: need to explain how to use, including one-shot, ongoing, performance vs power savings, etc.
# cat /proc/irq/*/smp_affinity f f ... f
# cat /proc/interrupts CPU0 CPU1 CPU2 CPU3 0: 64 0 0 0 IO-APIC-edge timer 1: 3 0 0 0 IO-APIC-edge i8042 4: 391 0 0 0 IO-APIC-edge serial 6: 3 0 0 0 IO-APIC-edge floppy 7: 1 0 0 0 IO-APIC-edge parport0 8: 0 0 0 0 IO-APIC-edge rtc0 9: 0 0 0 0 IO-APIC-fasteoi acpi 14: 0 0 0 0 IO-APIC-edge ata_piix 15: 109 0 0 0 IO-APIC-edge ata_piix 16: 0 0 0 0 IO-APIC-fasteoi uhci_hcd:usb1 18: 0 0 0 0 IO-APIC-fasteoi uhci_hcd:usb3 19: 0 0 0 0 IO-APIC-fasteoi uhci_hcd:usb2 24: 5322 0 0 0 IO-APIC-fasteoi sata_mv 28: 3123 0 0 0 IO-APIC-fasteoi eth0 29: 309 0 0 0 IO-APIC-fasteoi eth1 76: 15 0 0 0 IO-APIC-fasteoi aic79xx 77: 15 0 0 0 IO-APIC-fasteoi aic79xx NMI: 1 1 0 2 Non-maskable interrupts LOC: 11214 17272 11515 20843 Local timer interrupts SPU: 0 0 0 0 Spurious interrupts PMI: 1 1 0 2 Performance monitoring interrupts IWI: 0 0 0 0 IRQ work interrupts RES: 5213 6451 6676 9397 Rescheduling interrupts CAL: 398 3671 329 357 Function call interrupts TLB: 536 700 1205 1114 TLB shootdowns TRM: 0 0 0 0 Thermal event interrupts THR: 0 0 0 0 Threshold APIC interrupts MCE: 0 0 0 0 Machine check exceptions MCP: 4 4 4 4 Machine check polls ERR: 0 MIS: 0
squeeze only had 0.56-1 on release. I backported 1.0.3-1 and uploaded to squeeze backports. I also backported 1.0.6-2 and uploaded to squeeze-backports-sloppy and wheezy-backports.
So far 1.0.6-2 seems to do more balancing than 1.0.3 (at least on willet, need to try on something with more irq’s).
FIXME: need to provide an example of a system where installing irqbalance fixes things, with before/after /proc/interrupts and munin graphs.