Realtime Preemption

Table Of Contents:

Overview
Realtime Preemption is (as of this writing 12/21/2004) a patch which tries to improve realtime performance of the Linux kernel.

Recent patches from Ingo include a (large) number of technologies for improving preemption and debugging preemption issues with the Linux kernel.

An overview of the technologies is as follows: *
 * voluntary preempt = a set of voluntary preemption points for the kernel, to improve normal scheduling latency (These changes basically
 * BKL change to semaphore
 * latency tracer

Voluntary Preempt
Overview:
 * if it's on at compile time, it can be turned off at runtime with the command line: "voluntary-preemption=0" or "voluntary-preemption=off"
 * Creates a new function might_resched, which is used by might_sleep.
 * might_resched calls cond_resched if voluntary preemption is on.
 * Adds might_sleep in several places.

Conversion of Spinlocks to Mutexes
According to Ingo Molnar, it's primary author, "the big change in this release is the addition of PREEMPT_REALTIME, which is a new implementation of a fully preemptible kernel model"

For a brief description of the overall technology, see: http://kerneltrap.org/node/3995?PHPSESSID=4bc02ae16e5a27308031f3cd664fd574

Briefly, the technology makes spinlocks and rwlocks preemptible by default.
 * the patch auto-detects at compile-time the type of lock to use for a spinlock (mutex or original raw_spinlock)
 * it uses a feature of gcc to manage this (reducing patch size)
 * it uses native Linux semaphores for preemption
 * it convert rwlocks to rw-semaphores
 * apparently, about 90 locks are targetted for NON-conversion to preemptibility (that is, they are preserved as RAW_SPINLOCKS)

Ingo mentioned at one time that this was about 20% of the locks in his kernel configuration, implying that there were about 450 spinlocks present in the kernel in his configuration.

Ingo said this about how well this works on Un-processor (UP) systems versus SMP systems.

...and no matter how well UP works, to fix SMP one has to 'cover' all the necessary locks first before fixing it, which (drastic) increase in raw locks invalidates most of the UP efforts of getting rid of raw locks. That's why i decided to go for SMP primarily - didnt see much point in going for UP.

Normally, in UP the spinlocks are compiled away. When PREEMPT is turned on (without the new patch) these spinlocks are turned into markers for non-preemptible regions. When RT-PREEMPT is used,

people working on/interested in this stuff

 * Ingo Molnar, Red Hat, voluntary preemption, Ingo real-time preemption
 * Sven Dietrich, Monta Vista, MV real-time preemption
 * Daniel Walker, Monta Vista, priority inheritance??
 * John Cooper, Time Sys, ???
 * Tim Bird, Sony, port to 2.6.10-native, port to PPC
 * Scott Woods, Time Sys, IRQ threading??

people working on related stuff

 * Bill Huey, Lynux Works??, mmlinux

Comments regarding the scheduling of RT tasks
Ingo said (in this message): note that my -RT patchset includes scheduler changes that implement "global RT scheduling" on SMP systems. Give it a go, it's at:

http://redhat.com/~mingo/realtime-preempt/

you have to enable CONFIG_PREEMPT_RT to active this feature. I've designed this code to not hurt non-RT scheduling, and i've optimized performance for the 'lightly loaded case' (which is the most common to occur on mainline-using systems).

A very short description of the design: there's a global 'RT overload counter' - which is zero and causes no overhead if there is at most 1 RT task in every runqueue. (i.e. at most 2 RT tasks on a 2-way system, at most 4 RT tasks on a 4-way system, etc.) If the system gets into 'RT overload' mode (e.g. the third RT task gets activated on a 2-way box), then the scheduler starts to balance the RT tasks agressively. Also, whenever an RT task is preempted on a CPU, or is woken up but cannot preempt a higher-prio RT task on a given CPU, then it's 'pushed' to other CPUs if possible. This design avoids global locking (it avoids a global runqueue), which simplifies things immensely. (I first tried a global runqueue for RT tasks but the complexity impact was much bigger.)

(note that these scheduler changes are resonably self-contained and do not depend on other parts of PREEMPT_RT, so in theory they could be added to mainline too, after some time - given lots of testing and broad agreement.)

comments regarding the hard parts of this work
Ingo says (at: http://groups-beta.google.com/group/linux.kernel/msg/cf036477d30ab736)

some of the harder stuff:

- the handling of per-CPU data structures (get_cpu_var)

- RCU and softirq data structures

- the handling of the IRQ flag

comments about the number of raw spinlocks needed
Ingo says (at: http://groups-beta.google.com/group/linux.kernel/msg/e63b2860d2e993dd)

Sven Dietrich  wrote:

> IMO the number of raw_spinlocks should be lower, I said teens before.

> Theoretically, it should only need to be around hardware registers and > some memory maps and cache code, plus interrupt controller and other > SMP-contended hardware.

yeah, fully agreed. Right now the 90 locks i have means roughly 20% of all locking still happens as raw spinlocks.

But, there is a 'correctness' _minimum_ set of spinlocks that _must_ be raw spinlocks - this i tried to map in the -T4 patch. The patch does run on SMP systems for example. (it was developed as an SMP kernel - in fact i never compiled it as UP :-|.) If code has per-CPU or preemption assumptions then there is no choice but to make it a raw spinlock, until those assumptions are fixed.

Rationale
This feature is intended to provide much better realtime scheduling response for a Linux system.

Projects
Various parties are working on ports: Time Sys and Monta Vista, in particular, seem to have made ports to PPC and ARM platforms.

Specifications
None that I'm aware of.

Online resources
The original announcement for voluntary-preemption:
 * http://people.redhat.com/mingo/realtime-preempt/older/ANNOUNCE-voluntary

Here's some stuff by Jonathon Corbet:


 * http://lwn.net/Articles/106010/
 * http://lwn.net/Articles/107269/
 * http://lwn.net/Articles/108216/
 * http://lwn.net/Articles/129511/

There's a page of links about RT for audio at:
 * http://www.affenbande.org/~tapas/wiki/index.php?Low%20latency%20for%20audio%20work%20on%20linux%202.6.x

A brief introduction of RT patch (Sorry, in Japanese only):
 * http://www.atmarkit.co.jp/fembedded/rtos03/rtos03a.html

Patch
See http://redhat.com/~mingo/realtime-preempt/

Utility programs
[other programs, user-space, test, etc. related to this technology]

How To Use

 * apply patch
 * choose desired preemption level
 * compile kernel

Configuration variables
The patch introduces (or modifies) the following configuration variables:


 * retrieved from patch with command:

grep "[+-]config " realtime-preempt-2.6.10-mm1-V0.7.34-01 | sed "s/[+-]config //" | sort | uniq

How to validate
[put references to test plans, scripts, methods, etc. here]
 * use included trace feature, or
 * use included latency overrun reporting mechanism

Related projects
Monta Vista released a similar technology, which had the following features:

See http://groups-beta.google.com/group/linux.kernel/msg/7eeef031d9ec1446

These RT enhancements are an integration of features developed by others and some new MontaVista components:

- Voluntary Preemption by Ingo Molnar - IRQ thread patches by Scott Wood and Ingo Molnar - BKL mutex patch by Ingo Molnar (with MV extensions) - PMutex from Germany's Universitaet der Bundeswehr, Munich - MontaVista mutex abstraction layer replacing spinlocks with mutexes

Sample Results
[Examples of use with measurement of the effects.]

Case Study 1

 * Linux RT Benchmarking Framework
 * http://www.opersys.com/lrtbf/
 * Summary of dicussion in LKLM (sorry in Japanese)
 * http://japan.linux.com/kernel/05/07/25/2334226.shtml?topic=1
 * http://japan.linux.com/kernel/05/08/29/0817208.shtml?topic=1

Case Study 2
Trevor Woerner published some results in November 2005 regarding some latency measurements he have been recording on the 2.6.14 kernel with Ingo's patches.

See http://geek.vtnet.ca/embedded/LatencyTests/html/index.html

Status
(one of: not started, researched, implemented, measured, documented, accepted) (for each arch, one of: unknown, patches apply, compiles, runs, works, accepted)
 * Status: [not started??]
 * Architecture Support:
 * i386: unknown
 * ARM: unknown
 * PPC: unknown
 * MIPS: unknown
 * SH: unknown

Future Work/Action Items
Here is a list of things that could be worked on for this feature: - help with mainlining??? - perform testing on multiple platforms - provide use cases for justification - what else? - break patch into manageable pieces - doesn't Ingo use any kind of patch management system???

people who expressed interest
Manas Saksena, Jon Masters, Takeharu Kato, Ralph Siemsen, Jyunji Kondo