Test Power Control Notes

Here are some notes on power control issues in a board farm.

= Introduction = Switching a board on/off, or resetting it, can be done at multiple levels: 1. External power supply, 2. Board ("accessory") power switch (may be toggle or momentary), 3. Reset button (may be more than one, e.g. master and peripheral reset), 4. software reboot: /sbin/reboot, 5. software power off: /sbin/poweroff.

Of course all 5 (or more) may behave differently, and thus should be tested.

Different systems may refer to different items in this list as 'power on', 'booting', 'reset' or 'reboot' (unfortunately, not every system uses the same terminology).

This document primarily concerns itself with the power control API between a test framework and a system that controls the external power supply to a board. It also may refer to the API between the power control system and the power control devices.

= Issues =

multiple inputs
Some boards have multiple external power supply inputs. It is very common for a development board to have a DC jack for wall power, but also to support being powered over USB.

staggered power application
Many boards will use more power during the first few seconds of operation than during their regular operation. This can cause problems for a lab with a large number of machines. If all boards are turned on simultaneously, then there can be a power demand spike which can cause hard-to-trace issues and hard-to-reproduce failures.

To deal with this, many power control devices have support staggering the application of power to multiple ports, when a command is issued that affects multiple ports.

delay during power cycle
The memory of a device might take a few seconds to drain it's data. Or, a board might require some time period between removing power and applying power in order to boot correctly. (That is, it needs to stay in the 'off' state in order to let hardware settle before it should be turned back on again). For this reason, many power control units allow specifying a delay before or after an operation, to give time for the hardware to respond.

synchronous versus asynchronous operation
Some power control devices may take some time to perform their operations. Some are controlling a large number of devices, and might be busy with other devices when a new request to manipulate the power for a device comes in. Or, they might use network or USB operations to perform a control operation, and this might take some time. So some power control devices allow power control requests to be queued so that they can be acted upon later, while the requesting client can continue with other operations.

Put differently, the interface to the power control system may support either synchronous operations, asynchronous operations, or both.

board to port mapping
One thing a power control system may manage is the mapping between the power control hardware and it's ports, and the device under test (and it's power inputs).

So, for example, if a board called 'bbb' is connected to port 5 of a power control unit, the power control systems maintains a mapping so that a client can turn power on and off by board name rather than by referencing the port by number.

= survey of verbs and features = Verbs: on off cycle

Tim said: One other option for a name for the operation of turning the power off and on again, might be 'cycle'. I've seen that used in some documentation for PDUs. And it would not conflict with names used elsewhere in the board control stack. So: "power-control minnowboard1 cycle" would be the command for turning the power off and on again, to the board named minnowboard1.

Geert said: I have several boards where the hardware manual clearly states that external power must only be restored or cut while the power switch is in the off position. Obviously this is violated by all board farms that keep this switch in the on position permanently, and control power on/off and reset by method #1...

= different systems =

pdudaemon
Is written in Python, and has modules to interact with a large number of power control devices. "PDU" stands for Power Distribution Unit.

It has a command-line API on the top side, as well as a network API, and it has a python plugin architecture, where a python module can provide a new "driver" for the power control system.

ttc
ttc is Sony's board control layer, implemented as a linux command line tool.

It is implemented in Python, and uses shell snippets to implement individual functions. It does not maintain a set of known power control snippets (or "drivers"). It is primarily used as a mapping layer so that test programs have a uniform interface to connect to different boards. command lines are in the form: ttc 

The 'ttc' verbs related to power control are:
 * on - turn power to the board on
 * off - turn power to the board off
 * pos - "power on status" - show the current status of power to the board
 * reset - reset the board (usually means to try a software reboot, or toggling a hardware 'reset' button on the board)
 * reboot - do a full hardware boot of the board (usually means a power cycle of external power)
 * status - show status of a board, including the power status

More details
ttc uses 'reset' to mean a board reset - which sometimes corresponds to the Linux software 'reboot' command, and no hardware/power intervention, and sometimes to correspond to toggling the hardware 'reset' button (or pin). Note that for some boards the hardware reset button has a different effect than a power cycle would, which is why these operations are separate in ttc. (For example, in some cases it may be possible to retrieve the kernel log messages from memory after a reset but not after a reboot.)

ttc uses 'reboot' to mean a hardware power cycle, with associated re-loading of the kernel and rootfs.

ttc reboot performs a composite operation, which means that 'ttc reboot' does the entire process of power cycle, firmware bootstrap, and kernel boot. It would call out to a power-control layer for portions of the whole reboot operation. As such, it would be a client of the power control operation.

ttc uses the verbs 'on' and 'off' for power control. And 'ttc status' is used for status of more than just power control. 'ttc status' reports 4 things: 1) power status 2) network status (is device pingable?) 3) operational status (can a command be executed on the board?) 4) reservation status (does a test or a user have the board reserved, and for how long?)

See https://elinux.org/Ttc_Program_Usage_Guide (this is somewhat dated, unfortunately)

The power status by 'ttc status' (and what it expects from subsidiary helper scripts and apps) is a single word from the set ('ON', 'OFF', 'UNKNOWN') (exactly as spelled, and in all uppercase).

Fuego
Fuego internally uses the following functions:

rootfs_reboot - for a software or distribution-initiated reboot (ie Linux 'reboot' command) board_control_reboot - for a hardware reboot (ie PDU power cycle)

Fuego doesn't use the 'reset' terminology.

= Standards = pdudaemon was selected (at Automated Testing Summit 2018 as the standard for controlling power to a board in a lab.

The document containing this standard is at: or
 * https://github.com/dave-pigott/pdudaemon/blob/master/share/powercontrolapi.md
 * https://docs.google.com/document/d/1-f2VNVlOnaJUSKUUWeYko3wXh7_ertbFD55y_0dTZvI