(written by taggart, mistakes are his, please send corrections, work in progress, etc)

Problem

On multiproccessor systems you want to take advantage of all processors. However, moving a process from one CPU to another is often a net loss since the data that process is using is already in cache on that CPU, so moving it to another requires loading it again which is usually way more expensive than any benefit gained. So Linux’s default behavior is for processes to affine with the particular CPU they started on (rather than bounce around).

BUT interrupt processing doesn’t use much data, so spreading interrupts across CPUs doesn’t result in much cache misses, so it might be a win to spread them out across CPUs. BUT there is another problem…

For a network controller processing packets, state about the connection (like tcp objects, etc) is already loaded in the cache of one CPU. So if interrupts for the NIC move between CPUs that can be inefficient for the same reasons listed above. (more details)

There is a solution: MSI-X and modern NICs have the ability to have multiple queues per NIC, with each queue getting it’s own interrupt. Packets are assigned to queues according to a hash based on the pair of IP addresses/ports communicating. So then all packets for a particular connection will always go to the same queue/interrupt. Which allows us to map interrupts to particular CPUs and spread out the load without cache misses. (more details)

Details

By default in Linux on most system, all hardware interrupts end up getting serviced by CPU 0. This can result in bottlenecks and reduced performance. The solution is that on x86 you have an APIC that can divide interrupt processing among the cores using smp affinity to assign interrupts to different cores.

In Linux you can use /proc/irq/*/smp_affinity to set a bitmask to control which cpu’s interrupts will go to. The default is all bits set (some number of "f"’s depending on how many cpus you have) which should spread interrupts across all CPUs. However…

Systems APIC’s are setup by the BIOS and can operate in a couple different modes; “physical” where all interrupts get assigned to CPU0 by default and stay that way until you adjust smp_affinity, and “logical” where IRQs can “round-robin”. So the behavior you get is dependent on the chipset capabilities and what the system vendor decided to set in the BIOS.

You can determine which mode your IO-APIC is in by looking at what is reported in dmesg at boot. Unfortunately it appears to be chipset driver specific, but here are some examples of what to look for

Enabling APIC mode:  Flat.  Using 5 I/O APICs

Setting APIC routing to flat

FIXME: but what does this mean?

Here are some articles 1, 2 that explains the problem and how to use SMP affinity.

bitmasks

Here are the bitmasks to select just a single CPU for the first 8 cpus.

CPU Mask
0 00
1 01
2 02
3 04
4 08
5 10
6 11
7 12

Solution

The ideal solution would be for each machine to look at it’s hardware and workload and balance the interrupt processing for that machine by hand. This is what the people doing benchmarking and deploying highly tuned systems do. But most of us don’t have time to hand tune each hardware/workload configuration so…

irqbalance

irqbalance is a program designed to automatically balance interrupts for you. From the upstream site:

Irqbalance is the Linux utility tasked with making sure that interrupts from your hardware devices are handled in as efficient a manner as possible (meaning every cpu handles some of the interrupt work), while at the same time making sure that no single cpu is tasked with a inappropriately large amount of this work. Irqbalance constantly analyzes the amount of work interrupts require on your system and balances interrupt handling across all of your systems cpus in a fair manner, keeping system performance more predictable and constant. This should all happen transparently to the user. If irqbalance performs its job right, nobody will ever notice it's there or want to turn it off.

Sounds good right? Well the problem was that for a while it was buggy and earned a poor reputation. Upstream fixed many of the problems found in the 0.5x releases and the 1.x releases are supposed to be much improved. But if you search for information on irqbalance you will find lots of old blog entries talking about problems. Don’t let that scare you but also keep in mind that like all things magic, they don’t always do the right thing. So where you can, test without irqbalance and with a 1.0.x or newer release and watch /proc/interrupts, /proc/irq/*/smp_affinity, and use something like munin to graphs interrupts and cpu.

Links

SMP-affinity.txt
IRQ-affinity.txt
lkml thread on IRQ routing
RHEL’s documentation.
An old thread on tuning network performance on multiple e1000 NICs.
blog article on how MSI-X helps solve this interrupt problem.
blog post saying irqbalance isn’t needed on modern AMD SMP cpus since they have dedicated L2 cache(?). But it also cites version 0.56, so if there is any real claim here, maybe 1.0.x addresses it?
article on installing irqbalance on Xen dom0’s
performance tuning for 10g NICs
Fedora bug explaining that running on systems with <= 2 cpus or 1 cache domain doesn’t make sense.
RHEL bug explaining the IRQBALANCE_BANNED_CPUS and IRQBALANCE_BANNED_INTERRUPTS options
The original 0.56 and older irqbalance upstream page.
ubuntu wiki page on Rescheduling Interrupts, mostly written from a laptop power savings perspective but still interesting.

Using

FIXME: need to explain how to use, including one-shot, ongoing, performance vs power savings, etc.

Before irqbalance

# cat /proc/irq/*/smp_affinity
f
f
...
f
# cat /proc/interrupts 
            CPU0       CPU1       CPU2       CPU3       
   0:         64          0          0          0   IO-APIC-edge      timer
   1:          3          0          0          0   IO-APIC-edge      i8042
   4:        391          0          0          0   IO-APIC-edge      serial
   6:          3          0          0          0   IO-APIC-edge      floppy
   7:          1          0          0          0   IO-APIC-edge      parport0
   8:          0          0          0          0   IO-APIC-edge      rtc0
   9:          0          0          0          0   IO-APIC-fasteoi   acpi
  14:          0          0          0          0   IO-APIC-edge      ata_piix
  15:        109          0          0          0   IO-APIC-edge      ata_piix
  16:          0          0          0          0   IO-APIC-fasteoi   uhci_hcd:usb1
  18:          0          0          0          0   IO-APIC-fasteoi   uhci_hcd:usb3
  19:          0          0          0          0   IO-APIC-fasteoi   uhci_hcd:usb2
  24:       5322          0          0          0   IO-APIC-fasteoi   sata_mv
  28:       3123          0          0          0   IO-APIC-fasteoi   eth0
  29:        309          0          0          0   IO-APIC-fasteoi   eth1
  76:         15          0          0          0   IO-APIC-fasteoi   aic79xx
  77:         15          0          0          0   IO-APIC-fasteoi   aic79xx
 NMI:          1          1          0          2   Non-maskable interrupts
 LOC:      11214      17272      11515      20843   Local timer interrupts
 SPU:          0          0          0          0   Spurious interrupts
 PMI:          1          1          0          2   Performance monitoring interrupts
 IWI:          0          0          0          0   IRQ work interrupts
 RES:       5213       6451       6676       9397   Rescheduling interrupts
 CAL:        398       3671        329        357   Function call interrupts
 TLB:        536        700       1205       1114   TLB shootdowns
 TRM:          0          0          0          0   Thermal event interrupts
 THR:          0          0          0          0   Threshold APIC interrupts
 MCE:          0          0          0          0   Machine check exceptions
 MCP:          4          4          4          4   Machine check polls
 ERR:          0
 MIS:          0

Debian

squeeze only had 0.56-1 on release. I backported 1.0.3-1 and uploaded to squeeze backports. I also backported 1.0.6-2 and uploaded to squeeze-backports-sloppy and wheezy-backports.

So far 1.0.6-2 seems to do more balancing than 1.0.3 (at least on willet, need to try on something with more irq’s).

Data

FIXME: need to provide an example of a system where installing irqbalance fixes things, with before/after /proc/interrupts and munin graphs.