System Size Auto-Reduction

This page has notes and an outline for Tim Bird's Linux Auto-Reduction research.

Talk info
Tim gave a talk on this research at LinuxCon Japan 2013 (May 29 in Tokyo, Japan)

Title
Advanced size optimization of the Linux kernel

Abstract
This presentation will cover recent research by Tim on aggressive size reduction of the Linux kernel. This will include results from using gcc link-time optimization (LTO) with the ARM architecture (using Andi Kleen's out-of-tree patches), as well as results and discussion of other optimization techniques (including whole-system optimization for embedded devices).

This talk is directed at kernel developers interested in reducing the size of their Linux systems (and possible improving their performance in the process). The talk will be highly technical.

Final slides from talk
[[Media:Bird-Kernel-Size-Optimization-LCJ-2013.pdf]]

LTO

 * What is it?
 * what was required to get it to work?
 * Andi Kleen's patch set
 * what do they do?
 * how big are they?
 * mainline status?
 * what is the size gain (see ELC poster)
 * what can be done with it?
 * long-term possibilities for LTO

global constraints

 * overall idea: create constraints external to code, and use for optimization
 * rationale: can't maintain in-tree - too many config items
 * make the application of constraints automatic
 * use existing constraints to generate new constraints
 * constraints can flow between user-space and kernel


 * example: uid=0
 * constraint language
 * application by commenting out references (replace with 0 constant)
 * use compiler to find code references (via error messages)
 * eliminates problem with duplicate names (uid in different structure)
 * constant propagation (by, e.g. LTO) reduces code

syscall elimination

 * scan file system
 * create report of used and unused system calls
 * mark syscalls unused in kernel
 * arch/arm/kernel/calls.S (and arch/arm/kernel/entry-common.S
 * make sure unused syscalls are not __attribute__(externally_visible)
 * technique of asmlinkage_
 * use LTO to eliminate calls
 * results: 50K-90K

ARM stack reduction

 * 4k stacks
 * stack extensions

cold code compression

 * D. Chanet did cold code compression
 * consists of:
 * profiling the kernel
 * marking code regions as cold or frozen
 * replacing them with stubs
 * compressing them
 * At execution time:
 * if a stub is called, it decompresses the code and calls it
 * stub is fixed up to directly call decompressed code in future
 * code is left decompressed forever

cold code compression
Results:
 * MUST see paper for details (it's quite complicated)
 * on 2.4.25 kernel
 * cold code compression resulted in 7% reduction for i386 kernel and
 * 11.7% reduction for ARM kernel

Talk outline
This talk will be presented at LinuxCon Japan 2013:

Title

 * Advanced size optimization of the Linux kernel
 * by Tim Bird, Sony Mobile Communication

Self-Introduction

 * I am Tim Bird
 * Now working at Sony Mobile
 * Researching system size for many years
 * Long background in extremely small systems
 * pre-professional: first program on TRS-80, in basic, 8K ram
 * NetWare Lite - file and print server in 50K (in 1991)

The problem of Bloat

 * Software bloat occurs because systems are built with more software than is really needed for a given task
 * Open Source software meets the needs of thousands of different systems
 * Linux scales from tiny sensors to supercomputers (extreme SMP and high-end clusters)
 * Linux supports many, many features, only some of which are configurable
 * Software must be generalized for many use cases
 * bloat problem is:
 * How to re-specialize the software, eliminating unused features and dead code?

Bloat (cont.)

 * Software gets more generalized over time
 * Can't use strategy of manual tuning (config options)
 * It gets harder and harder to remove things over time
 * About 13,000 config items now (2.6.12 had 4700)
 * You have to be an expert in too many things to reduce the kernel
 * Must rely on automated methods of reduction
 * Should use an additive, rather than subtractive method of building a system
 * ultimate vision: indicate what you want/need, and build up system to support it

Bloat (cont. 2)

 * In desktop or server, virtual memory makes bloat issue less important for user-space programs
 * Only working set of program is loaded - pages are loaded on demand
 * For kernel, all pages are always loaded

Automatic reduction (intro)
The problem with automatic reduction is that "the system" doesn't know what software is needed and what is not. there needs to be a way to tell it about things that are not going to be used.

auto-reduce - story of 8 bytes of bloat
Story of the conditional check in kdb:
 * I found a bug in kdb, when a particular option was using in the configuration file
 * not everyone uses the configuration file
 * not everyone uses the particular option
 * bug only triggered in those circumstances
 * I wrote a small patch, to guard against use of a variable prematurely
 * problem: all users of KDB now have this check, and suffer this overhead
 * it wasn't much, just a single compare
 * but this is how bloat builds up over time
 * It bothered me because I knew most people didn't need the check
 * "correct" solution would be to parse the config file, and make the code compile-time configurable
 * this adds more complexity than it is worth.

generalizing the problem of bloat
System doesn't know inputs:

will never occur due to fixed use cases.
 * It's very easy to configure the kernel to omit the driver for missing hardware.
 * It's very difficult to configure the kernel to omit error handling for bugs that

An example of fixed input (uid in kernel)

 * throughout the kernel, there are references to uid
 * comparisons, storing, referencing
 * it turns out this is set by setuid, by the 'login' program.
 * login does a lookup and validates user account name in /etc/passwd
 * what if /etc/passwd only has 'root' and no others?
 * setuid could only be called with a value of 0
 * can I encode this constraint on the system.

types of constraints
There are numerous other examples of constraints:
 * kernel command line arguments never used
 * syscalls never called by any program
 * parameters that are never used, or parameter values that are never passed in
 * e.g. ioctl value that is not possible
 * this only works in a fixed
 * /proc values never referenced
 * /sys values never referenced

also goes back up to user-space

 * return values that are not possible

kernel command line args

 * Documented in Documentation/kernel-parameters.txt
 * defined with __setup and early_param from include/linux/init.h
 * approximately 480 __setup routines in kernel
 * about 200 __setup_* in System.map on ARM kernel build (98 __setup_str_*)
 * about 230 early_param routines in kernel
 * points to function
 * almost always sets a variable, which would default to 0

(non-LTO kernel)
 * on ARM, with only console_setup and early_mem marked as 'used', there was a 19K difference in size:

vmlinux.baseline-setup-used =>  vmlinux-param-used baseline  other   change  percent text: 7680084  7663472   -16612   0% data:  362868   360516    -2352   0% bss:  745312   745184     -128   0% total: 8788264  8769172   -19092   0%

(LTO kernel)
 * on ARM, with only console_setup and early_mem marked as 'used', there was a 19K difference in size:

vmlinux.lto-param =>  vmlinux.param-used baseline  other   change  percent text: 1653672  1648920    -4752   0% data:  131636   130244    -1392  -1% bss:   50688    50528     -160   0% total: 1835996  1829692    -6304   0%

System.map from kernel with console_setup and early_mem as only routines marked 'used':

$ grep __setup System.map c00ea4bc T __setup_irq.153323 c00f1adc t __setup_per_zone_wmarks.172539.15755 c019d570 t __setup_str_early_mem.21664.160821 c019d884 t __setup_str_console_setup.61958.160201 c019ef00 t __setup_early_mem.21659.160819 c019ef00 T __setup_start c019ef0c t __setup_console_setup.61953.160195 c019ef18 T __setup_end

/proc values

 * Includes sysctl values
 * there are approximately 1200, NOT related specifically to a process
 * about 120 per process
 * about 80 related to networking (on my desktop box)
 * 40 others

Tiny Distribution

 * poky-tiny distribution (yocto project)
 * see https://wiki.yoctoproject.org/wiki/Poky-Tiny
 * Good for testing and further research

Related Projects

 * AnyKernel and Rumpkernel - see thesis by Antti Kantee - pooka (at) iki (dot) fi
 * https://github.com/rumpkernel/wiki/wiki
 * provides a system based on NetBSD to isolate sub-systems and drivers and allow their use in micro-kernels and user-space
 * haven't read enough of it to determine if it could be applied to Linux, but sounds like just API wrapping
 * I'm not sure how robust it would be in the context of rapid mainline churn

Materials

 * [[File:0001-ARM-LTO-avoid-errors-on-unified-assembly-macros.patch]]
 * [[File:0001-Support-elimination-of-unused-kernel-parameters.patch]]
 * [[File:0001-Add-option-to-omit-unused-syscalls.patch]]