Kernel dynamic memory analysis

This page has notes and results from the project Kernel dynamic memory allocation tracking and reduction

[This page is fairly random at the moment...]

Instrumentation overview

 * Slab_accounting patches
 * uses __builtin_return_address(0) to record the address of the caller, the same mechanism used by kmem events
 * starts from very first allocation


 * Ftrace kmem events
 * does not start until ftrace system is initialized, after some allocations are already performed
 * supported in mainline - no need to add our own instrumentation

These two instrumentation methods are basically the same: trap each kmalloc, kfree, etc. event and produce relevant information with them. The difference between them is that the first post-processes the events in-kernel and create a /proc/slab_account file to access the results. This output is more or less like this:

total bytes allocated: 1256052 total bytes requested: 1077112 slack bytes allocated:  178940 number of allocs:         7414 number of frees:          5022 number of callers:         234 total   slack      req alloc/free  caller 2436     232     2204    29/0     bio_kmalloc+0x33 268       8      260     1/0     pci_alloc_host_bridge+0x1f 32       8       24     1/0     tracepoint_entry_add_probe.isra.2+0x86 44       8       36     1/0     cpuid4_cache_sysfs_init+0x30 0       0        0     0/3     platform_device_add_data+0x33 [...]

On the other hand, analysing ftrace kmem events will defer post-processing to be done at user space, thus achieving much more flexibility. A typical trace log would be like this:

TODO

The disadvantage of the ftrace method is that it needs to be initialized before capturing events. Currently, this initialization is done at fs_initcall and we're working on enabling them earlier. For more information, checkout this upstreamed patch:

trace: Move trace event enable from fs_initcall to core_initcall

This patch allows to enable events at core_initcall. It's also possible to enable it at early_initcall. Another posibility is to create a static ring buffer and then copy the captured events into the real ring buffer.

Also, we must find out if early allocations account for significant memory usage. If not, it may not be that important to capture them. Yet another possibility is to use a printk brute-force approach for very early allocations, and somehow coalesce the data into the final report.

Using debugfs and ftrace
For more information, please refer to the canonical trace documentation at the linux tree:


 * Documentation/trace/ftrace.txt
 * Documentation/trace/tracepoint-analysis.txt
 * and everything else inside Documentation/trace/

(Actually, some of this information has been copied from there.)

The debug filesystem it's a ram-based filesystem that can be used to output a lot of different debugging information. This filesystem is called debugfs and can be enabled with CONFIG_DEBUG_FS:

Kernel hacking [*] Debug filesystem

After you enable this option and boot the built kernel, it creates the directory /sys/kernel/debug as a location for the user to mount the debugfs filesystem. Do this manually:

$ mount -t debugfs none /sys/kernel/debug

You can add a link to type less and get less tired:

$ ln -s /debug /sys/kernel/debug

Once we have enabled debugfs, we need to enable tracing support. This is done with CONFIG_TRACING option, this option will add a /sys/kernel/debug/tracing directory on your mounted debugfs filesystem. Traced events can be read through debug/tracing/trace.

To dynamically enable trace events you need to enable CONFIG_FOO. Once it is enabled you can see the available events by listing TODO.

TODO TODO TODO TODO

To enable events on bootup you can add them to kernel parameters, for instance to enable kmem events: trace_event=kmem:kmalloc,kmem:kmem_cache_alloc,kmem:kfree,kmem:kmem_cache_free

Warning: if you use SLOB on non-NUMA systems, where you might expect kmalloc_node not get called, actually it is the only one called. This is due to SLOB implementing only kmalloc_node and having kmalloc call it without a node. Same goes to kem_cache_alloc_node.

Obtaining accurate call sites (or The painstaking task of wrestling against gcc)
The compiler inlines a lot automatically and without warning. In this scenario, it's impossible to get the real call site name based on just calling address.

When some function is inlined, it gets collapsed and it won't get listed as a symbol if you use tools like readelf, objdump, etc.

Does this matter? Well, it matters if you want to obtain an accurate call site report when tracing kernel memory events (which will see later).

However, there is one solution! You can turn off gcc inlining using an options on kernel Makefile. The option is called 'no-inline-small-functions'. See this patch:

diff --git a/Makefile b/Makefile index 8e4c0a7..23f1a88 100644 --- a/Makefile +++ b/Makefile @@ -363,6 +363,7 @@ KBUILD_CFLAGS  := -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs \ -fno-strict-aliasing -fno-common \ -Werror-implicit-function-declaration \ -Wno-format-security \ +                 -fno-inline-small-functions \ -fno-delete-null-pointer-checks KBUILD_AFLAGS_KERNEL := KBUILD_CFLAGS_KERNEL :=

Of course, this option makes a bit smaller and slower kernel, but this is an expected side-effect on a debug-only kernel.

We must keep in mind that no matter what internal mechanisms we use to record call_site, if they're based on __builtin_address, then their accuracy will depend entirely on gcc *not* inlining automatically.

The enfasis is in the automatic part. There will be lots of functions we will need to get inlined in order to determine the caller correctly. These will be marked as __always_inline.

See upstreamed patch:

Makefile: Add option CONFIG_DISABLE_GCC_AUTOMATIC_INLINING)

Reporting

 * extracting data to host
 * tool for extraction (perf?, cat /debugfs/tracing/ ?)
 * post-processing the data
 * grouping allocations (assigning to different subsystems, processes, or functional areas)
 * idea to post-process kmem events and correlate with */built-in.o
 * reporting on wasted bytes
 * reporting on memory fragmentation

Visualization

 * possible use of treemap to visualize the data

Mainline status
[place links to patches, or git commit ids, here]
 * is anything added to mainline via this project?
 * subject: trace: Move trace event enable from fs_initcall to early_initcall
 * https://lkml.org/lkml/2012/8/17/218

Results so far (in random order)

 * There's a lot of fragmentation using the SLAB allocator. [how much?]
 * SLxB accounting is a dead-end (it won't be accepted into mainline)

more???