Difference between revisions of "Kernel dynamic memory analysis"

From eLinux.org
Jump to: navigation, search
((2012-10-05): indent error message)
(Results so far (in random order))
Line 434: Line 434:
 
Here is some feedback from Tim:
 
Here is some feedback from Tim:
  
==== (2012-10-05) ====
+
==== 2012-10-05 testing ====
 
These items are not in any order:
 
These items are not in any order:
  

Revision as of 09:33, 5 October 2012

This page has notes and results from the project Kernel dynamic memory allocation tracking and reduction

[This page is fairly random at the moment...]

Instrumentation overview

  • Slab_accounting patches
    • uses __builtin_return_address(0) to record the address of the caller, the same mechanism used by kmem events
    • starts from very first allocation
  • Ftrace kmem events
    • does not start until ftrace system is initialized, after some allocations are already performed
    • supported in mainline - no need to add our own instrumentation

These two instrumentation methods are basically the same: trap each kmalloc, kfree, etc. event and produce relevant information with them. The difference between them is that the first post-processes the events in-kernel and create a /proc/slab_account file to access the results. This output is more or less like this:

total bytes allocated:  1256052
total bytes requested:  1077112
slack bytes allocated:   178940
number of allocs:          7414
number of frees:           5022
number of callers:          234

  total    slack      req alloc/free  caller
   2436      232     2204    29/0     bio_kmalloc+0x33
    268        8      260     1/0     pci_alloc_host_bridge+0x1f
     32        8       24     1/0     tracepoint_entry_add_probe.isra.2+0x86
     44        8       36     1/0     cpuid4_cache_sysfs_init+0x30
      0        0        0     0/3     platform_device_add_data+0x33
[...]

On the other hand, analysing ftrace kmem events will defer post-processing to be done at user space, thus achieving much more flexibility. A typical trace log would be like this:

TODO

The disadvantage of the ftrace method is that it needs to be initialized before capturing events. Currently, this initialization is done at fs_initcall and we're working on enabling them earlier. For more information, checkout this upstreamed patch:

trace: Move trace event enable from fs_initcall to core_initcall

This patch allows to enable events at core_initcall. It's also possible to enable it at early_initcall. Another posibility is to create a static ring buffer and then copy the captured events into the real ring buffer.

Also, we must find out if early allocations account for significant memory usage. If not, it may not be that important to capture them. Yet another possibility is to use a printk brute-force approach for very early allocations, and somehow coalesce the data into the final report.

Using debugfs and ftrace

For more information, please refer to the canonical trace documentation at the linux tree:

  • Documentation/trace/ftrace.txt
  • Documentation/trace/tracepoint-analysis.txt
  • and everything else inside Documentation/trace/

(Actually, some of this information has been copied from there.)

Debugfs

The debug filesystem it's a ram-based filesystem that can be used to output a lot of different debugging information. This filesystem is called debugfs and can be enabled with CONFIG_DEBUG_FS:

Kernel hacking
     [*] Debug filesystem

After you enable this option and boot the built kernel, it creates the directory /sys/kernel/debug as a location for the user to mount the debugfs filesystem. Do this manually:

$  mount -t debugfs none /sys/kernel/debug

You can add a link to type less and get less tired:

$ ln -s /debug /sys/kernel/debug

Tracing

Once we have enabled debugfs, we need to enable tracing support. This is done with CONFIG_TRACING option, this option will add a /sys/kernel/debug/tracing directory on your mounted debugfs filesystem. Traced events can be read through debug/tracing/trace.

To dynamically enable trace events you need to enable CONFIG_FOO. Once it is enabled you can see the available events by listing TODO.

TODO
TODO
TODO
TODO

To enable events on bootup you can add them to kernel parameters, for instance to enable kmem events:

trace_event=kmem:kmalloc,kmem:kmem_cache_alloc,kmem:kfree,kmem:kmem_cache_free

Warning: if you use SLOB on non-NUMA systems, where you might expect kmalloc_node not get called, actually it is the only one called. This is due to SLOB implementing only kmalloc_node and having kmalloc call it without a node. Same goes to kem_cache_alloc_node.

Obtaining accurate call sites (or The painstaking task of wrestling against gcc)

The compiler inlines a lot automatically and without warning. In this scenario, it's impossible to get the real call site name based on just calling address.

When some function is inlined, it gets collapsed and it won't get listed as a symbol if you use tools like readelf, objdump, etc.

Does this matter? Well, it matters if you want to obtain an accurate call site report when tracing kernel memory events (which will see later).

However, there is one solution! You can turn off gcc inlining using an options on kernel Makefile. The option is called 'no-inline-small-functions'. See this patch:

diff --git a/Makefile b/Makefile
index 8e4c0a7..23f1a88 100644
--- a/Makefile
+++ b/Makefile
@@ -363,6 +363,7 @@ KBUILD_CFLAGS   := -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs \
                   -fno-strict-aliasing -fno-common \
                   -Werror-implicit-function-declaration \
                   -Wno-format-security \
+                  -fno-inline-small-functions \
                   -fno-delete-null-pointer-checks
 KBUILD_AFLAGS_KERNEL :=
 KBUILD_CFLAGS_KERNEL :=

Of course, this option makes a bit smaller and slower kernel, but this is an expected side-effect on a debug-only kernel.

We must keep in mind that no matter what internal mechanisms we use to record call_site, if they're based on __builtin_address, then their accuracy will depend entirely on gcc *not* inlining automatically.

The enfasis is in the automatic part. There will be lots of functions we will need to get inlined in order to determine the caller correctly. These will be marked as __always_inline.

See upstreamed patch:

Makefile: Add option CONFIG_DISABLE_GCC_AUTOMATIC_INLINING)

Memory accounting

Types of memory

The kernel, being a computer program, can consume memory in two different ways: statically and dynamically.

Static memory can be measured offline, therefore accounted before actually running the kernel using standard binary inspection utilites (readelf, objdump, size, etc). We will explore this utilities in detail.

Dynamic memory cannot be measured offline, and it's not only necessary to probe a running kernel but also to enable aditional probe code to trace each allocation. Fortunately for us, the linux kernel has ftrace which is a tracing framework that allows to trace general events, and in particular memory allocation events. We will explore this framework in detail.

Static memory

A compiled kernel will allocate static memory to store two kinds of symbols: code symbols (a.k.a text) and data symbols (a.k.a data).

For instance, let's look at this piece of C code:

long curr;
int max = 10;
int foo(int var)
{
    int i = var + max;
    if (i < max)
        curr = i;
}

We have three different symbols:

  • curr : a variable (data zero initialized)
  • max  : an initialized variable (data non-zero initialized)
  • foo  : a function (text)

Once this code is running each of these symbols will need memory for its own storage. However, the zero initialized variable will not use space in the compiled binary. This is due to a special section inside the binary (called bss for no good reason) where all the zero initialized variables are placed. Since they carry no information, they need no space. Static variables have the same life as the executing program.

On the other side, var and i variables are dynamically allocated, since they live in the stack. They are called automatic variables, meaning that they have a life cycle that's not under our control.

Note that when we talk about static memory, the word static has nothing to do with the C-language keyword. This keyword references a visibility class, where static means local, as opposed to global.

The size command

The most simple command to get a binary static size, is the wonderfully called size command. Let's start by seeing it in action:

$ size ./fs/ext2/ext2.o 
  text	   data	    bss	    dec	    hex	filename
 51273	     68	      8	  51349	   c895	./fs/ext2/ext2.o

According to this output, this object file has roughly 50k bytes of text and 68 bytes of data. Now, size comes in two flavors: berkeley and sysv. Each of this shows a different output. The default is berkeley, so the previous example was a berkeley output.

However, if we use the same command with sysv output format, we'll find very different results:

$ size --format=sysv ./fs/ext2/ext2.o 
./fs/ext2/ext2.o  :
section             size   addr
.text              43005      0
.init.text           138      0
.exit.text            25      0
.rodata             2304      0
.rodata.str1.1      1656      0
.rodata.str1.4      3485      0
.data                 60      0
.exitcall.exit         4      0
.initcall6.init        4      0
.bss                   8      0
.note.GNU-stack        0      0
.comment             675      0
[...]

Here we see a more detailed description about each section size. Note the appearence of a .rodata (read-only data) section, of 2k byte large. This section is composed of read-only variables (e.g. marked const) that are not accounted by the standard size format.

We can conclude that standard size format gives an incomplete picture of the compiled object.

To add even more confusion to this picture, gcc can decide (at his own will) to put inside .rodata section symbols not marked as const. These symbols are not written by anyone, and gcc considers them as read-only (pretty smart, uh?). This means you can have a .rodata section bigger than what you expected to have.

This happens since gcc v4.7 (???)

readelf

This two commands can give us any information we need about a binary. In particular, they can output the complete list of symbols with detailed information about each one. Let's see an example of readelf on the same file we used for size. The output is stripped for clarity.

$ readelf -s fs/ext2/ext2.o
Symbol table '.symtab' contains 413 entries:
  Num:    Value  Size Type    Bind   Vis      Ndx Name
    0: 00000000     0 NOTYPE  LOCAL  DEFAULT  UND 
    1: 00000000     0 SECTION LOCAL  DEFAULT    1 
[...]
  339: 00004da0   286 FUNC    GLOBAL DEFAULT    1 ext2_evict_inode
  340: 000003e0    76 OBJECT  GLOBAL DEFAULT    7 ext2_nobh_aops
  341: 00000780   128 OBJECT  GLOBAL DEFAULT    7 ext2_symlink_inode_operat
  342: 000008d8    20 OBJECT  GLOBAL DEFAULT    7 ext2_xattr_acl_default_ha
  343: 00000000     0 NOTYPE  GLOBAL DEFAULT  UND generic_file_aio_write
  344: 00000280   128 OBJECT  GLOBAL DEFAULT    7 ext2_file_inode_operation
  345: 00000000     0 NOTYPE  GLOBAL DEFAULT  UND __dquot_alloc_space
  346: 00000000     0 NOTYPE  GLOBAL DEFAULT  UND generic_setxattr
  347: 00000000     0 NOTYPE  GLOBAL DEFAULT  UND unlock_buffer
  348: 000014c0    36 FUNC    GLOBAL DEFAULT    1 ext2_bg_num_gdb
  349: 00005240   684 FUNC    GLOBAL DEFAULT    1 ext2_setattr
[...]

For instance, ext2_nobh_aops is an OBJECT symbol (data) of 76 bytes and ext2_evict_inode is a FUNC symbol (text) of 286 bytes. Notice there are some UND symbols. They are undefined symbols for this file, that are defined elsewhere and therefore not of interest for us when inspecting a file size.

Of course, this output can be combined with grep to get fantastic results. Let's count the numbers of defined functions:

$ readelf -s fs/ext2/ext2.o | grep -v UND | grep FUNC

With a little awk magic we could even sum these sizes and get the size of the .text section. TODO!

objdump

TODO

Dynamic

Dynamic memory in kernel land is a little different from user land.

In user land, all one needs to do to get a chunk of memory is call malloc(). In kernel land, we have a similar function: kmalloc(). But we also have lots of other functions to alloc memory, and we must have some special considerations.

The first thing it's important to understand is that kernel obtains memory (well, in most architectures) on a fixed-size chunk, that we call a 'page' of memory. This page of memory tipically is 4096 bytes large, but this depends on the architecture.

In order to delivery smaller pieces of memory, the kernel have a few couple of layers that ultimately lets you do kmalloc(100) and get 100 bytes. These layers are called: buddy allocator and slab allocator.

We will focus on the latter. Slab allocator comes in three different flavors: SLAB, SLOB and SLUB. These funny names are historically, but the meaning is:

  • SLAB is the traditional
  • SLOB is aimed at tiny embedded systesm (e.g. without mmu)
  • SLUB is the default

Each of these implement the allocation in a different way, but they all share a common property: internal fragmentation.

Internal fragmentation

For different reasons (alignment, overhead, etc) when we request 100 bytes with kmalloc(100) the slab allocator may really allocate 128 bytes (or 140 bytes, we can't really know). These extra 28 bytes can't be used, and therefore you are wasting them. This is called internal fragmentation, and one of the main goals of the slab allocator is to minimize it. In other words, trying to match as nearly as possible the requested size with the truly allocated size.

Accounting with kmem events trace

Ftrace kmem events are a great source of information. By using them you can trace each kmalloc, getting the requested bytes, the allocated bytes, the caller address and the returned pointer. You can also trace kfree, getting the caller address and the freed pointer.

Once you have the caller address you can use System.map file to get the caller function name. Also, by using the returned pointer and correlating with kfree traces you can keep track of currently used dynamic memory by each kernel function / subsystem.

Let's see this in detail.

Enabling and reading kmem trace

We can activate this on boot up with kernel parameter trace_event. For instance,

trace_event=kmem:kmalloc,kmem:kmem_cache_alloc,kmem:kfree,kmem:kmem_cache_free

or you can activate on-the-fly with:

TODO

Once you have enabled events, you can run your favourite program (in order to trigger some allocations). When you're done you may disable events (or not) and read them:

$ cat /sys/kernel/debug/tracing/trace > kmem.log

Let's see a piece of this log:

TODO

This log can be post-processed. For your convenience we have a script trace_analyze.py that does exactly this (see below).

About kmem trace events

As we have seen there are a few more events than kmalloc and kfree. Let's see them all:

  • kmalloc
  • kfree
  • kmalloc_node
  • kmem_cache_alloc
  • kmem_cache_alloc_node
  • kmem_cache_free

Using the trace_analyze.py post-processing script

Getting the script

You can get script last version from this project's github repository

 git clone https://github.com/ezequielgarcia/kmem-probe-framework
 cd kmem-probe-framework
 ./post-process/trace_analyze.py -h

Using trace_analyze.py for static analysis

trace_analyze.py typically needs access to: a built kernel tree and an ftrace kmem log.

However, if one lacks the latter but can provide a built kernel tree, the script will fallback to 'static' analysis.

Using it

Usage is fairly simple

 $ ./trace_analyze.py -k /usr/src/linux
 $ ./trace_analyze.py --kernel /usr/src/linux

This should produce a ringchart png file in the current directory, named linux.png.

Of course, you can use absolute and relative paths in that parameter

 $ ./trace_analyze.py -k ./torvalds

If you're interested in a specific subsystem you can use a parameter to specify the directory tree branch to take as root

 $ ./trace_analyze -k ./torvalds -b fs
 $ ./trace_analyze -k ./torvalds -b drivers
 $ ./trace_analyze -k ./torvalds -b mm

Each of this commands will produce a ringchart png file in the curent directory, named as the specified branch (fs.png, drivers.png, mm.png, etc).

Under the hood

The script will perform a directory walk, internally creating a tree matching the provided kernel tree. On each object file found (like fs/inode.o) it will perform a 'readelf --syms' to get a list of symbols contained in it.

Nothing fancy.

TODO

  • Account for minor differences between running 'size' and using this script

Using trace_analyze.py for dynamic analysis

  • TODO

Reporting

  • extracting data to host
    • tool for extraction (perf?, cat /debugfs/tracing/<something>?)
  • post-processing the data
    • grouping allocations (assigning to different subsystems, processes, or functional areas)
      • idea to post-process kmem events and correlate with */built-in.o
    • reporting on wasted bytes
    • reporting on memory fragmentation

Visualization

Matplotlib [[1]]

We will use matplotlib with its Wedge api to create a ring chart (similar to gnome baobab).

We'll get something like this (this is very minimal kernel run):

Current dynamic footprint

Current dynamic.png

Static footprint

Static.png

Mainline status

Recommendations for reductions

Results so far (in random order)

  • There's a lot of fragmentation using the SLAB allocator. [how much?]
  • SLxB accounting is a dead-end (it won't be accepted into mainline)

more???

Testing feedback

Here is some feedback from Tim:

2012-10-05 testing

These items are not in any order:

  • Install issue

There's no 'install' mechanism for the tool, and the tool depends visualize_mem_tree.py being on the python path or in the same directory as the trace_analyze.py script. I tried hard-linking trace_analyze to a private 'bin' directory, but got a python error doing this. Using a symlink worked OK. If visualize_mem_tree.py is not used separately, it might be worth considering integrating everything into a single script. (or not - I'm not sure).

  • Extra rings

The chart for my kernel shows some unexpected rings. Specifically, I'm building in /home/tbird/work/mem/build/panda-torvalds My chart has an inner ring of 'build' (showing 100% of memory), a second ring of 'panda-torvalds' (showing 100% of memory), then a third ring where directories start to get broken out.

See this chart: Linux-2012-10-05.png

Is this correct?

  • Overlapping annotations

The annotations showing the name and size of the ring segments overlap sometimes, making some of them difficult or impossible to read. See above chart.

  • Availability to show plot interactively

The tool by default produces a png in the current directory. It would be nice to be able to control the output format and location, or allow the user to directly enter matplotlibs interactive viewer. (i.e. pylab.show()).

  • Problem with absolute paths

I tried using the tool with an absolute path, and it had problems:

$ trace_analyze.py -k /home/tbird/work/mem/build/panda-torvalds/
No trace log file specified: will report on static size only
Reading symbol map at /home/tbird/work/mem/build/panda-torvalds
Creating tree from compiled symbols at /home/tbird/work/mem/build/panda-torvalds
readelf: Error: 'home/tbird/work/mem/build/panda-torvalds/.tmp_kallsyms1.o': No such file
readelf: Error: 'home/tbird/work/mem/build/panda-torvalds/.tmp_kallsyms2.o': No such file
... [lots more 'No such file' errors]