CELF PM Requirements 2006

CELF PM Requirements draft for 2006
Please edit this page and / or send comments to mark.gross@intel.com

The current draft CELF PM requirements document is

Forward
This is a "wiki-ized" version of a draft requirements document. The goal of this page is to enable easy participation in providing additional input and ideas.

For those items that are not well defined or clear, a call for the PM working group to fill in details will go out on the mailing list. If we cannot get enough substance behind such requirements where I can defend them to the developer community I will not include them from the formal PDF requirements document. I strongly encourage people to insert comments and ideas into this wiki. Especially where there are capabilities or features needed by developers!

The cut off date for closing items getting into the more formal PDF document is April 1.

Introduction
This document presents the requirements for Linux Power Management from the perspective of Consumer Electronics device developers. The requirements vary from well defined feature requests to calls for analysis, benchmarking tools and methods. Some items call for investigation that are too preliminary to be considered requirements.

CE products have additional, and different, power management needs as compared to laptop computers. It is important to communicate these additional needs to the OS developer community. This document is to define the needs such that the OS developers can understand these needs.

Goals for this requirements document

 * Capture the PM issues CE/embedded developers are facing.
 * Provide exposure to the community of these issues.
 * Encourage idea exchange and feature development
 * Derive CELF PM requirements in PDF format.
 * where possible define new work with enough specificity that new projects could be started up by CELF PMWG members and the community at large.

Introduction to CE Power management
Power management for CE devices is a bit different from power management for desktop, servers or even laptop computer, even though there is a overlap. The differences tend to come from application specific areas.


 * CE devices tend to be application specific. They are designed to do only a few things well.  Such as play back video or provide the UI for a PDA or cell phone.  They are constrained to running only core applications at all costs to the point that if the device cannot execute its function, then it might as well power off completely.


 * CE devices can have thermal constraints not common to other platforms. Hand held devices cannot allow high skin temperatures. Burning the user's hands is not good ergonomics.  It should be kept in mind that some computer components, e.g. CPU's, can get hot very quickly but take a long time to cool off.


 * CE devices can have different performance value systems compared to laptops and servers. For instance the time  it takes to launch a non-critical task may have latencies that would be unacceptable on a laptop or server,  and yet is not a problem for a CE application.  Conversely latencies that would be acceptable for a laptop operation could be way out of operational or usability limits for a CE application.  It all depends on the application the CE device is implementing.

Partitioning of requirements
The requirements identified to date fall into the following categories:


 * Interface (kernel and user mode)
 * Platform Throttling
 * Process / OS Throttling
 * Low power kernel processing
 * Sleep state support
 * System load prediction
 * Measurement and benchmark
 * New ideas to consider

The following sections will explore each of these categories and provide specific requirements or issues that could be investigated further.

Interface (kernel and user mode)
There is also a huge difference in control logic for clock/voltage scaling between PC's and CE devices. In PCs CPU frequency is often scaled independently from the rest of the system. In embedded world this is hardly the case. CPU frequency is often synchronous with the bus frequency. A number of peripherals connected to the bus can derive frequencies from the bus clock and therefore may need to be reprogrammed in case the bus frequency is changed (all of this as part of the CPU frequency change). Besides there can be several masters on the bus (CPUs, DMAs, etc). In such systems scaling a CPU clock has huge influence on the system performance and should be used with care. Such decisions should be based on much more information than simply idle time on one of the CPUs. Needless to say that apart from CPU frequency scaling, embedded system often scale the bus frequency to squeeze the energy consumption even further. In practice, there numerous dependencies between different clocks and voltages in the system, which are typically known to the system designer only.

It is important to bring some structure in this chaos by providing a generalizing API/framework. PowerOP from Todd looks like a first attempt in the right direction. However, more needs to come. Sorting out the system dependencies and expressing them in a generalized way is one of the most critical requirements to PM in CE domain.

Application specific PM is another aspect. Again here, the system designer is responsible for estimating the workload and performance requirements of main applications on the system components. As this knowledge is not explicitly present in the system, the PM results in guessing strategies (such as idle time monitoring). Whereas for PC with many applications, heuristic strategies are the only way to go for PM. In embedded world the fine-tuning of applications is a normal practice.

Therefore it seems affordable that if application can share explicitly some information to help PM subsystem to improve the "guessing" accuracy. The requirement on providing interfaces from PM framework to applications to grasp their performance requirements and/or monitor their real-time activity is very important too.

(i.e. MPlayer telling about the video playback performance not only to the user but also directly to the PM subsystem via a standard interface to maintain optimum PM and not drop frames).

Kernel and user mode API's ACPI independent
Most CE platforms do not include ACPI platform firmware interfaces. It is important that efforts be taken to make sure that power management API's do not explicitly or implicitly assume ACPI support of behavior.

Without this it becomes difficult to reuse power management solutions across platforms and architectures.

Throttling controls
* more platform throttling API's (freq, memory bus speed, IO speed, fan, peripherals) * PowerOP? * more system / OS throttling API's * more cpufreq governors

More metric APIs
Today we only sample idle time, what other things could be sampled and used as control input for policies?

* more platform metric API's  * fan * thermal static * thermal rate of change * power load * battery static * battery rate of change

* more system / OS load metric API's  * fork latency * ave time spent in TASK_UNINTERRUPTIBLE for specific tasks * application WFI (Wait For Interrupt) scheduling latency * scheduler load * lock contention * dead line head room * other stuff?

Platform Throttling
Today we have basic CPU throttling. CPUFREQ is an OK framework for this but we need more and analogous infrastructure for throttling other parts of the platform. Additionally we only have 4 basic governors, user, max performance, min performance, and 2 idle time controlled CPU frequency switching policies.

There are not a lot of architecture specific implementations of ACPI-like features for non-ACPI architectures. The implementation of the platform capabilities within ACPI on CE hardware is non-trivial. It would be good to get more support for proper platform power scaling on such non-ACPI capable architectures and sub-architectures.

CPUFREQ needs more governor options
There should be more governors. * RT deadline governor * UI responsiveness governor * Thermal * Fan control

CPUFREQ extended to core voltage control
Today CPUFREQ is very clock speed centric. For many systems one can control both core frequency and voltage. For some systems changing core voltage is a more expensive operation than changing frequency. To enable effective CPU frequency and core voltage control we need to extend the design of CPUFREQ to include the notion of target voltage.

SMP CPU throttling
After you have throttled the hardware back and you have more than one core under utilized. It may be useful to change the idle processing on those cores such that they enter higher latency idle states, and avoid scheduling tasks / sending interrupts to those cores.

Sometimes you cannot just shut off CPU's. But you can not schedule work on them, and use specific CPU instructions that are hi-latency and lower power than the normal idle process.

Memory bus throttling
Need policy architecture for memory bus throttling.

Could CPUFREQ be extended or generalized for this?

Memory speed throttling
Both DPM and the recently introduced powerop code have the concept of an operating point, which can tie bus/memory clock speed to a corresponding CPU clock frequency.

IO bus throttling
Need policy architecture for for IO bus Could CPUFREQ be extended or generalized for this?

Throttling Arbitrary Power Parameters Outside CPUFREQ
Another proposal, given the provisional name PowerOP, creates a new machine-level API that manages arbitrary hardware power/performance parameters. This API can be used by both cpufreq and by other power management mechanisms that wish to explicitly manage additional parameters, usually for embedded systems. cpufreq would then call the PowerOP layer to effect changes to hardware registers, etc. in response to changes in the "cpu speed" abstraction that cpufreq manages. Embedded power policy stacks, such as DPM, would also call this layer in response to changes in interfaces used by those stacks (usually lower-level abstractions, perhaps directly exposing the hardware registers).

This would allow both cpufreq and other power management software to share the same hardware-specific code, assuming there continues to be a place for both types of PM interfaces, one for desktop/laptop/server systems and one for embedded systems.

The linux-pm community has also discussed adding comprehensive power policy management frameworks that could subsume the functionality that was discussed for PowerOP.

Peripheral throttling
Device power management provides one solution for peripheral power control. This will be useful for user mode governors.

Support for the maximal throttling of powering off the peripheral is also needed.

Need policy architecture for platform device and peripheral throttling Could CPUFREQ be extended or generalized for this?

Temperature based throttling
Many portable devices are made for human touch. Usability needs for preventing the device from overheating and to prevent burning the user's hands is needed. Platform vendors will be including temperature sensors in these devices, and the OS and application will need to have some policies in place to use it.

More policy interfaces than CPUFreq
Yet another CPUFREQ like thing? Or something else? Could CPUFREQ be extended or generalized for this?

This could involve the policy interface that DPM exposes, a new kernel-side policy manager, an altogether new, user-side policy manager. Systems integrators and OEMs have often had to write their own "resource managers" on top of their platform OS. Would it be possible to come up with something similar for embedded systems running Linux? There are already some of the building blocks in DPM and the software suspend code.

Non-ACPI architecture specific power control
For platforms that cannot afford the overhead of adding ACPI platform firmware / BIOS support for power management, there is a need for the architecture and sub architecture support of more or less equivalent capabilities. For example, the task of throttling the CPU frequency on some platforms without breaking the serial port connection, violating some constraints with memory controllers or interface specification to parts on the CE platform can be a significant challenge.

ACPI and platform firmware / BIOS take care of this type of stuff for developers underneath the OS on most Laptop, Desktop and server platforms from Intel. However; for CE devices there is very little support available in the arch and sub-arch kernel tree's.

Define Platform Power API that needs to be exported for non-ACPI architectures
I need CELF support in defining these. I'm ok with stating a requirement for an API definition but I'm not going to define the API.

Example Implementation of API for ARM sub-Arch
Need to get a platform vendor to put up effort to implement this type of thing.

Example Implementation of API of arch X
There is more than just ARM out there. Need PMWG members to step up and define these guys.

Process / OS Throttling
Sometimes its not enough to throttle the platform. Throttling the work load is the next level of control. Support for process level throttling is needed to in order to take power management to the next level and provide systems that will maintain thermal and battery constraints.

The OS should be aware of trigger points for HW protection points, where the HW will shut off power and loose user data. If the platform is approaching one of these triggers, and the OS can throttle itself to avoid loss of data this would be a good thing.

Rate limit interrupts
Sometimes hardware or drivers can have bugs in them that result in interrupt storms that result in interrupt processing consuming 100% of the CPU. When these happen, the system will not get an opportunity to execute any policy logic. For handheld devices such an event could result in skin temperatures outside of an acceptable range triggering a HW shut down. It could also result in a deep discharge of the battery such that it damages some battery technologies.

It would be good to have some type of protection in the OS for such things.

Process controlled based on power states
The idea is to provide a background thread policy that would be kept out of the run state as a function of PM state. For instance, my laptop sometimes runs updatedb cron jobs for me when it's on battery, and it would be better if the updatedb process would just sleep until I re-tether my system to the wall.

I could see this type of thing applied to thermal throttling too. My daughter's laptop gets quite hot when she has some web pages opened (flash banner adds burning mips). It would be interesting to throttle such threads in the scheduler.

* Policies for tethered vs. un-tethered * Policies for thermal control (e.g. keep the fan from starting up by not running some cron job.) * Scheduler policy classes that schedule tasks based on some TBD policy manager Run levels tied to PM policy

Asymmetric Suspend Resume
The ability to suspend a system under one power state and resume it under different conditions and selectively resume the suspended processes. One user scenario is the following: the system is tethered running at full power and more or less operating assuming it had unlimited battery. The user suspends, and takes the device to a coffee shop or airplane and resumes. This time under battery power. It would be a cool thing to selectively only resume processes that are required to run when under battery. The selection criterion for resume could include IP address, or MAC of current DHCP server, and powered state (AC / Battery).

SMP process throttling
Some multi-core systems must throttle both cores symmetrically. This means that if your work load is low enough to not need the other core, the most you can do is to simply not send any interrupts to it and to avoid scheduling any tasks to run on it.

Doing this effectively from a PM governor (say a CPUFREQ governor) posses some interesting challenges in coupling scheduler behavior to PM design.

Lower power kernel processing
The kernel does a lot of book keeping and processing on its own as a result of design choices for things like process accounting, scheduler design for CPU bound multi-processing, and other things. This area of PM requirements focuses on reducing the number of instructions the kernel does over time, as well as things that happen in the kernel that get in the way of putting the CE platform into a possible lower power state for longer times.

One thing to keep in mind is the for CE platforms and applications it is sometimes acceptable to violate POSIX.

Low power idle
Many platforms provide hardware support for different types of idle states. It would be good to have standardized ways for extending the type of idle processing dynamically.

Runtime selection of idle states
For ACPI platforms this can be related to C states, but for non-ACPI platforms the lower power idle states need to be entered by the OS explicitly

It would be good to enable some type of policy framework for controlling which platform idle states are entered from idle. (CPUFREQ-like thing?)

Tick-less idle
When in a high latency lower power idle state its not helpful to have the timer interrupt wake up the CPU to do nothing useful.

Variable Scheduling Timeouts (VST)
This is more general than tick less idle as it goes farther to remove the periodic timer ticks from the scheduler design all together.

Reducing Tick Overhead
The timer tick processing is getting bloated for CE applications where each instruction results in power lose. Efforts to minimize the work done by the kernel when processing timer ticks would help a lot for CE and embedded applications.

It would be acceptable to the CE and embedded application developers to sacrifice some POSIX compatibility for this. Of course such an implementation would require compile time switches.

Sleep state support
* Need both suspend to RAM and suspend to disk * Generalized suspend /resume implementations that can be easily extended to suspend to flash. * Robust suspend /resume operations * Need to define quality metric and test cases for suspend resume and drive community toward such goals * Video recovery (VGA platforms only / set top box) * Better debug logging for suspend / resume success and failure * Low latency suspend / resume control * Resume takes a long time to the initial resume, and then even longer before the OS stops thrashing about. * The resume thrash costs battery * The resume thrash breaks audio play back for a significant amount to time. * Suspend / Resume benchmark and testing

Need quality and performance benchmarks against suspend / resume implementations to be run on a regular bases * could we mimic Intel's performance benchmarking effort?

More sleep states
A number of CPUs and ASSPs currently used in mobile devices support different sleep states, although often the software support to exploit these in not available. Typically, a CPU has a low power mode that can be entered when the OS idle loop/idle task runs and that is exited by an interrupt. The latency associated with this "wait for interrupt" (WFI) state is low; no state saving is required and the OS can resume from where it left off. More recent CPUs have included support for other sleep states that are in-between WFI and the "system suspend" state in which power to the CPU is removed and external RAM is used to save/restore state. For example, the ARM11 family introduced a "dormant" power state, in which power is removed from the CPU core but not its cache RAMs, enabling a faster "warm start". TI's OMAP family of ASSPs include a number of sleep states, where progressively more and more of the device is powered down. ASSPs often include a "System Controller" that manages clocks and power gating, providing a number of clocking regimes, for example: BR * Off - system suspended * Crystal oscillator - CPU runs, limited range of peripherals available, * (probably) no SDRAM available * Main oscillator - CPU, more peripherals, SDRAM available * PLL - maximum performance, system devices available

The system may also permit frequency/voltage scaling within one of these clocking regimes - this is the area that CPUfreq and other proprietary software is aiming to tackle.

A question arising here would be: how does the OS know what peripherals it can use and what can it do in a given clocking regime? Maybe for a mobile device it would be acceptable not to run Linux at all in the crystal oscillator mode but keep this for boot-up and for the times when a minimal amount of CPU activity is required. This could be when the device is on but not in active use - a cell-phone powered on but not in a call, for example.

One possibility could be to define additional power management states, for example, PM_SUSPEND_DORMANT, and manage them through the existing Linux power management framework. However, as mention above, some clocking regimes might not be sufficient for us to run Linux!

Suspend to Flash
Sony is interested in this, and has done some work in this area. It would give all the benefits of Suspend to RAM with the added advantage of being able to remove power from external RAM too. A project within Samsung has looked at using NAND Flash as a swap device, to make use of existing Suspend to Disk code.

Suspend to RAM
There is support for this in some platforms - for example, Intel PXA and TI OMAP family parts. Power can be removed from the CPU when it is in a sleep mode because its state has previously been saved to external memory (usually SDRAM). The SDRAM itself is in a self-refresh mode. Exit from sleep is via a warm reset.

All suspend mechanisms rely on the Linux power management framework (see kernel/power/pm.c ) with architecture- and platform-specific code underneath.

System load prediction
The ability to predict system throughput and latency capability and needs is one of the largest gaps for the implementation of good power management designs today.

Today we only use the kstat data to monitor idle Could we do better if we knew: * Process contention for locks * Size of task list * CPU event counters * Number of tasks in run state * Number of tasks waiting on IO * Number of niced tasks * Number of RT tasks * RT processing deadline times * Latency from interrupt to process schedule (UI event) * Interrupt rates * average time of key tasks spent in TASK_UNINTERRUPTIBLE state

Need more ideas to be discussed.

PM measuring methodology definition
How to measure power * Understand the trade offs WRT where power is measured * Warn about issues with some test configurations. * How to measure battery life * Understanding of differences in battery technologies

PM benchmark work loads
* well defined target platforms * well defined work loads * Good PM robustness benchmark * Suspend/resume latency benchmark * Regular PM benchmark reporting

Formal control theory applied
More formal control theory analysis and design applied to the PM implementations. Today most of the PM solutions are heuristic and not designed from a control theory point of view. SISO vs. PID http://en.wikipedia.org/wiki/Control_theory

Continuous controls supported by policy managers
Today the controls are all discrete operating points. Some CE vendors have requested continuous control designs.

Need more creativity on the prediction problem
We don't do a good job of predicting what the system load will be next jiffy, we only measure what it was last jiffy.

Notes from the Power Management Summit
[attachment:PMsummit2006.ppt Mark's requirements slides, presented at the Summit]

[attachment:PmSummitFallOut.ppt Mark's action plan from the Summit]