The “swap insanity” problem, in brief
When running MySQL on a large system (e.g., 64GB RAM and dual quad core CPUs) with a large InnoDB buffer pool (e.g., 48GB), over time, Linux decides to swap out potentially large amounts of memory, despite appearing1 to be under no real memory pressure. Monitoring reveals that at no time is the system in actual need of more memory than it has available; and memory isn’t leaking, mysqld‘s RSS is normal and stable.
Normally a tiny bit of swap usage could be OK (we’re really concerned about activity—swaps in and out), but in many cases, “real” useful memory is being swapped: primarily parts of InnoDB’s buffer pool. When it’s needed once again, a big performance hit is taken to swap it back in, causing random delays in random queries. This can cause overall unpredictable performance on production systems, and often once swapping starts, the system may enter a performance death-spiral.
While not every system, and not every workload experiences this problem, it’s common enough that it’s well known, and for those that know it well it can be a major headache.
The history of “swap insanity”
Over the past two to four years, there has been an off-and-on discussion about Linux swapping and MySQL, often titled “swap insanity” (I think coined by Kevin Burton). I have followed it closely, but I haven’t contributed much because I didn’t have anything new to add. The major contributors to the discussion over the past years have been:
- Kevin Burton — Discussion of swappiness and MySQL on Linux.
- Kevin Burton — Proposed IO_DIRECT as a solution (doesn’t work) and discussed memlock (may help, but not a full solution).
- Peter Zaitsev — Discussed swappiness, memlock, and fielded a lot of discussion in the comments.
- Don MacAskill — Proposed an innovative (albeit hacky) solution using swap on ramdisk, and a lot more interesting discussion in the comments.
- Dathan Pattishall — Describes how Linux behavior can be even worse with swap disabled, and proposes using swapoff to clear it, but no real solution.
- Rik van Riel on the LKML — A few answers and proposal of the Split-LRU patch.
- Kevin Burton — Discussion of Linux Split-LRU patch with some success.
- Mark Callaghan — Discussion of vmstat and monitoring things, and a recap of a few possible solutions.
- Kevin Burton — More discussion that Linux Split-LRU is essential.
- Kevin Burton — Choosing the middle road by enabling swap, but with a small amount of space, and giving up the battle.
- Peter Zaitsev — More discussion about why swapping is bad, but no solution.
Despite all the discussion, not much has changed. There are some hacky solutions to get MySQL to stop swapping, but nothing definite. I’ve known these solutions and hacks now for a while, but the core question was never really settled: “Why does this happen?” and it’s never sat well with me. I was recently tasked with trying to sort this mess out once and for all, so I’ve now done quite a bit of research and testing related to the problem. I’ve learned a lot, and decided a big blog entry might be the best way to share it. Enjoy.
There was a lot of discussion and some work went into adding the relatively new swappiness tunable a few years ago, and I think that may have solved some of the original problems, but at around the same time the basic architecture of the machine changed to NUMA, which I think introduced some new problems, with the very same symptoms, masking the original fix.
Contrasting the SMP/UMA and NUMA architectures
The SMP/UMA architecture
The SMP, or UMA architecture, simplified
When the PC world first got multiple processors, they were all arranged with equal access to all of the memory in the system. This is called Symmetric Multi-processing (SMP), or sometimes Uniform Memory Architecture (UMA, especially in contrast to NUMA). In the past few years this architecture has been largely phased out between physical socketed processors, but is still alive and well today within a single processor with multiple cores: all cores have equal access to the memory bank.
The NUMA architecture
The NUMA architecture, simplified
The new architecture for multiple processors, starting with AMD’s Opteron and Intel’s Nehalem2 processors (we’ll call these “modern PC CPUs”), is a Non-Uniform Memory Access (NUMA) architecture, or more correctly Cache-Coherent NUMA (ccNUMA). In this architecture, each processor has a “local” bank of memory, to which it has much closer (lower latency) access. The whole system may still operate as one unit, and all memory is basically accessible from everywhere, but at a potentially higher latency and lower performance.
Fundamentally, some memory locations (“local” ones) are faster, that is, cost less to access, than other locations (“remote” ones attached to other processors). For a more detailed discussion of NUMA implementation and its support in Linux, see Ulrich Drepper’s article on LWN.net.
How Linux handles a NUMA system
Linux automatically understands when it’s running on a NUMA architecture system and does a few things:
- Enumerates the hardware to understand the physical layout.
- Divides the processors (not cores) into “nodes”. With modern PC processors, this means one node per physical processor, regardless of the number of cores present.
- Attaches each memory module in the system to the node for the processor it is local to.
- Collects cost information about inter-node communication (“distance” between nodes).
You can see how Linux enumerated your system’s NUMA layout using the numactl --hardware command:
# numactl --hardware available: 2 nodes (0-1) node 0 size: 32276 MB node 0 free: 26856 MB node 1 size: 32320 MB node 1 free: 26897 MB node distances: node 0 1 0: 10 21 1: 21 10
This tells you a few important things:
- The number of nodes, and their node numbers — In this case there are two nodes numbered “0” and “1”.
- The amount of memory available within each node — This machine has 64GB of memory total, and two physical (quad core) CPUs, so it has 32GB in each node. Note that the sizes aren’t exactly half of 64GB, and aren’t exactly equal, due to some memory being stolen from each node for whatever internal purposes the kernel has in mind.
- The “distance” between nodes — This is a representation of the cost of accessing memory located in (for example) Node 0 from Node 1. In this case, Linux claims a distance of “10” for local memory and “21” for non-local memory.
How NUMA changes things for Linux
Technically, as long as everything runs just fine, there’s no reason that being UMA or NUMA should change how things work at the OS level. However, if you’re to get the best possible performance (and indeed in some cases with extreme performance differences for non-local NUMA access, any performance at all) some additional work has to be done, directly dealing with the internals of NUMA. Linux does the following things which might be unexpected if you think of CPUs and memory as black boxes:
- Each process and thread inherits, from its parent, a NUMA policy. The inherited policy can be modified on a per-thread basis, and it defines the CPUs and even individual cores the process is allowed to be scheduled on, where it should be allocated memory from, and how strict to be about those two decisions.
- Each thread is initially allocated a “preferred” node to run on. The thread can be run elsewhere (if policy allows), but the scheduler attempts to ensure that it is always run on the preferred node.
- Memory allocated for the process is allocated on a particular node, by default “current”, which means the same node as the thread is preferred to run on. On UMA/SMP architectures all memory was treated equally, and had the same cost, but now the system has to think a bit about where it comes from, because accessing non-local memory has implications on performance and may cause cache coherency delays.
- Memory allocations made on one node will not be moved to another node, regardless of system needs. Once memory is allocated on a node, it will stay there.
The NUMA policy of any process can be changed, with broad-reaching effects, very simply using numactl as a wrapper for the program. With a bit of additional work, it can be fine-tuned in detail by linking in libnuma and writing some code yourself to manage the policy. Some interesting things that can be done simply with the numactl wrapper are:
- Allocate memory with a particular policy:
- locally on the “current” node — using --localalloc, and also the default mode
- preferably on a particular node, but elsewhere if necessary — using --preferred=node
- always on a particular node or set of nodes — using --membind=nodes
- interleaved, that is, spread evenly round-robin across all or a set of nodes — using --interleaved=all or --interleaved=nodes
- Run the program on a particular node or set of nodes, in this case that means physical CPUs (--cpunodebind=nodes) or on a particular core or set of cores (--physcpubind=cpus).
What NUMA means for MySQL and InnoDB
InnoDB, and really, nearly all database servers (such as Oracle), present an atypical workload (from the point of view of the majority of installations) to Linux: a single large multi-threaded process which consumes nearly all of the system’s memory and should be expected to consume as much of the rest of the system resources as possible.
In a NUMA-based system, where the memory is divided into multiple nodes, how the system should handle this is not necessarily straightforward. The default behavior of the system is to allocate memory in the same node as a thread is scheduled to run on, and this works well for small amounts of memory, but when you want to allocate more than half of the system memory it’s no longer physically possible to even do it in a single NUMA node: In a two-node system, only 50% of the memory is in each node. Additionally, since many different queries will be running at the same time, on both processors, neither individual processor necessarily has preferential access to any particular part of memory needed by a particular query.
It turns out that this seems to matter in one very important way. Using /proc/pid/numa_maps we can see all of the allocations made by mysqld, and some interesting information about them. If you look for a really big number in the anon=size, you can pretty easily find the buffer pool (which will consume more than 51GB of memory for the 48GB that it has been configured to use) [line-wrapped for clarity]:
2aaaaad3e000 default anon=13240527 dirty=13223315 swapcache=3440324 active=13202235 N0=7865429 N1=5375098
The fields being shown here are:
- 2aaaaad3e000 — The virtual address of the memory region. Ignore this other than the fact that it’s a unique ID for this piece of memory.
- default — The NUMA policy in use for this region.
- anon=number — The number of anonymous pages mapped.
- dirty=number — The number of pages that are dirty because they have been modified. Generally memory allocated only within a single process is always going to be used, and thus dirty, but if a process forks it may have many copy-on-write pages mapped that are not dirty.
- swapcache=number — The number of pages swapped out but unmodified since they were swapped out, and thus they are ready to be freed if needed, but are still in memory at the moment.
- active=number — The number of pages on the “active list”; if this field is shown, some memory is inactive (anon minus active) which means it may be paged out by the swapper soon.
- N0=number and N1=number — The number of pages allocated on Node 0 and Node 1, respectively.
The entire numa_maps can be quickly summarized by the a simple script numa-maps-summary.pl, which I’ve written while analyzing this problem:
N0 : 7983584 ( 30.45 GB) N1 : 5440464 ( 20.75 GB) active : 13406601 ( 51.14 GB) anon : 13422697 ( 51.20 GB) dirty : 13407242 ( 51.14 GB) mapmax : 977 ( 0.00 GB) mapped : 1377 ( 0.01 GB) swapcache : 3619780 ( 13.81 GB)
An couple of interesting and somewhat unexpected things pop out to me:
- The sheer imbalance in how much memory is allocated in Node 0 versus Node 1. This is actually absolutely normal per the default policy. Using the default NUMA policy, memory was preferentially allocated in Node 0, but Node 1 was used as a last resort.
- The sheer amount of memory allocated in Node 0. This is absolutely critical — Node 0 is out of free memory! It only contains about 32GB of memory in total, and it has allocated a single large chunk of more than 30GB to InnoDB’s buffer pool. A few other smaller allocations to other processes fin