Kernel Small Stacks

Revision as of 19:02, 29 November 2011 by Tim Bird (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Here is some random information about small kernel stack sizes.

The default stack size for a process running in kernel space is 8K (as of 2011).

There used to be an option on x86 to reduce the stack size to 4K. And indeed there were efforts in 2006 to make this the default stack size. However, using a small stack opens up the dangerous possibility that the stack will overflow, causing a kernel hang. The option to support 4k stacks on x86 was removed with this commit:

Besides wasting memory, if the stack space is not really needed, 8K stacks also have an effect on, and are affected by, general kernel memory allocation. To create an 8K stack requires an order-1 allocation, meaning that 2 contiguous physical pages must be allocated together in order to create a new process stack. If memory has become fragmented, it may be impossible to fulfill an order-1 allocation, even though individual pages of physical memory may be free. Thus 4K stack allocations (order-0 allocations) are more likely to succeed. This is important for systems operating under extreme memory pressure.

Stack layout

The kernel stack is laid out with the stack pointer at the top of each stack (at the highest stack address), growing downward for each function call and stack allocation. The thread_info structure for a process is at the bottom of the stack. There is no physical mechanism to detect, at allocation time, if the stack pointer wanders into the thread_info area of the stack. Hence, if the stack overflows (the stack pointer goes into the thread_info area), the behavior of the system is undefined.

Stack measuring/monitoring mechanisms

Because of previous efforts to conserve stack space, there are actually a few different mechanisms for monitoring the kernel stack usage. Some tools report on the static size of stack usage by kernel functions (a check which is done by either the compiler or a separate tool operating on the kernel binary), and some mechanisms can report on actual stack utilization at runtime.


The kernel source includes a script to perform static stack analysis called scripts/

Usage is as follows:

$(CROSS_COMPILE_PREFIX)objdump -d vmlinux | scripts/ [arch]

Replace [arch] with the architecture of the kernel being analyzed. Several architectures are supported, including arm, mips and x86. You should use a cross-objdump that matches the architecture you compiled the kernel for. For example, if you used: arm-gnueabi-linux-gcc as your compiler, you would use arm-gnueabi-linux-objdump as your object dump program. This should have been included in your cross-compiler toolchain package.

Below is some sample output from using Note that the file is first dumped to an assembly file (.S), and then piped to You can examine the assembly file to see in detail the instructions used to reserve space on the stack, for routines of interest found by

An item in brackets is a module name, in case of a loadable module. The number at end is stack depth detected for function. The Leading value is the address of the stack reservation code.

$ arm-eabi-objdummp -d vmlinux -o vmlinux-arm.S
$ cat vmlinux-arm.S | scripts/ arm
0x0012c858 nlmclnt_reclaim [vmlinux-arm.o]:             720
0x0025748c do_tcp_getsockopt.clone.11 [vmlinux-arm.o]:  552
0x00258d04 do_tcp_setsockopt.clone.14 [vmlinux-arm.o]:  544
0x000b2db4 do_sys_poll [vmlinux-arm.o]:                 532
0x00138744 semctl_main.clone.7 [vmlinux-arm.o]:         532
0x00138ec4 sys_semtimedop [vmlinux-arm.o]:              484
0x000c5618 default_file_splice_read [vmlinux-arm.o]:    436
0x00251de4 do_ip_setsockopt.clone.22 [vmlinux-arm.o]:   416
0x00191fd4 extract_buf [vmlinux-arm.o]:                 408
0x0019bc24 loop_get_status_old [vmlinux-arm.o]:         396
0x000e6f88 do_task_stat [vmlinux-arm.o]:                380
0x0019b8f0 loop_set_status_old [vmlinux-arm.o]:         380
0x002078f0 snd_ctl_elem_add_user [vmlinux-arm.o]:       376
0x0026267c tcp_make_synack [vmlinux-arm.o]:             372
0x00127be4 nfs_dns_parse [vmlinux-arm.o]:               368
0x000b2240 do_select [vmlinux-arm.o]:                   340
0x001f6f10 mmc_blk_issue_rw_rq [vmlinux-arm.o]:         340
0x001726a0 fb_set_var [vmlinux-arm.o]:                  336
0x000c58d0 __generic_file_splice_read [vmlinux-arm.o]:  316
0x0022a074 dev_seq_printf_stats [vmlinux-arm.o]:        316
0x0006383c tracing_splice_read_pipe [vmlinux-arm.o]:    308
0x000c53c8 vmsplice_to_pipe [vmlinux-arm.o]:            308
0x002512b4 do_ip_getsockopt [vmlinux-arm.o]:            304
0x00225f68 skb_splice_bits [vmlinux-arm.o]:             300 


There is kernel feature to output the stack usage of each process. This is controlled by the kernle configuration option CONFIG_DEBUG_STACK_USAGE.

to use this, at runtime you use 't' with sysrq. For example:

$ echo t >/proc/sysrq-trigger

A stack dump for each process is shown, along with stack usage information.

DI has a series of patches which implement a stack guard page, and use that to show a backtrace if the process uses more than 4k in its kernel stack.

This does the following:

* at process creation time, fills the stack with zeros (kernel/fork.c)
* on sysrq 't', show free space, from call to stack_not_used() (kernel/sched.c)
  * it shows as 0 otherwise ??
* define check_stack_usage(), which emits printks on each low-water hit
  * low-water appears to be global over all stacks
  * check_stack_usage() is only called on process exit, so you might
  not know about a problem process until very late
* stack_not_used() is defined in include/linux/sched.h.  It counts the number of
zero bytes following the end of thread_info going up.

stack structure:

top     +----------------+
        | return vals    |
        |   & local vars |
        | ...            |
        |                |
        |                |
        | 0's            |
        | thread_info    |
bottom  +----------------+

Here is some sample output:

$ echo t >/proc/sysrq-trigger
$ dmesg | grep -v [[]
  task                PC stack   pid father
init            S 802af8b0   932     1      0 0x00000000
kthreadd        S 802af8b0  2496     2      0 0x00000000
ksoftirqd/0     S 802af8b0  2840     3      2 0x00000000
kworker/0:0     S 802af8b0  2776     4      2 0x00000000
kworker/u:0     S 802af8b0  2548     5      2 0x00000000
migration/0     S 802af8b0  2704     6      2 0x00000000
migration/1     S 802af8b0  2704     7      2 0x00000000
kworker/1:0     S 802af8b0  2560     8      2 0x00000000
ksoftirqd/1     S 802af8b0  3024     9      2 0x00000000
khelper         S 802af8b0  2824    10      2 0x00000000
sync_supers     S 802af8b0  2872    11      2 0x00000000
bdi-default     S 802af8b0  2584    12      2 0x00000000
kblockd         S 802af8b0  2824    13      2 0x00000000
khubd           S 802af8b0  2744    14      2 0x00000000
rpciod          S 802af8b0  3024    15      2 0x00000000
kworker/0:1     S 802af8b0  1240    16      2 0x00000000
kswapd0         S 802af8b0  2848    17      2 0x00000000
fsnotify_mark   S 802af8b0  2632    18      2 0x00000000
nfsiod          S 802af8b0  3024    19      2 0x00000000
kworker/u:1     S 802af8b0  2840    20      2 0x00000000
hoge            S 802af8b0  3024    23      2 0x00000000
kworker/1:1     S 802af8b0  1716    24      2 0x00000000
flush-0:13      S 802af8b0  2528    28      2 0x00000000
telnetd         S 802af8b0  1848    48      1 0x00000000
ash             R running   1264    56      1 0x00000000

Information about mixed stack sizes

Currently, the method of accessing the thread_info structure for a task in the kernel relies on the stack size of all processes being consistent among all processes (and being a power of two).

A pointer to thread_info is obtained by masking the current stack pointer with a value dependent on the size of the stack.

A system to support dynamically adjusted different-sized stacks would likely be too complicated to be practical.

However, one can imagine support for mixed stack size (some tasks having 8K stacks, while others had 4K stacks), by:

* specifying at process allocation time the stack size
* using a more complicated algorithm at run-time to derive thread_info from the stack pointer

One candidate for a different system for deriving thread_info, is to group stack allocations into different pools (possibly having a fixed pool of pre-allocated 8k stacks), and using address comparisons to determine which pool a particular stack resided in, to determine the stack size.

Note that this system also requires that the stack allocation code know what size stack to allocate at process creation time.

kernel functions using large amounts of stack space

Using the scripts/stack_size program, you can analyze the amount of stack space used by each kernel function (for an ARM kernel).


Below are some results for static analysis of function stack depth in the Linux kernel, using 'stack_size'.


The following results include the reduction in size for 'struct poll_wqueue':

{{{ $ ./stack_size vmlinux-arm

====== RESULTS =========

number of functions = 14371 max function stack depth= 736 function with max depth = nlmclnt_reclaim

Function Name Stack Depth

=============== =====

__generic_file_splice_read 352 do_select 376 loop_set_status_old 392 snd_ctl_elem_add_user 408 extract_buf 432 default_file_splice_read 472 sys_semtimedop 520 semctl_main.clone.7 560 do_sys_poll 568 nlmclnt_reclaim 736 }}}


{{{ $ ./ vmlinux-x86_64.o

====== RESULTS =========

number of functions = 29587 max function stack depth= 1208 function with max depth = security_load_policy

Function Name Stack Depth

=============== =====

x86_schedule_events 632 drm_crtc_helper_set_mode 632 sys_semtimedop 664 do_task_stat 712 node_read_meminfo 760 default_file_splice_read 792 do_select 920 nlmclnt_reclaim 936 do_sys_poll 1048 security_load_policy 1208 }}}

Daily Work Log

* [wiki:ALP/KernelSmallStacks/Tim/DailyLog]


This area has random notes for this