Provisioning without hardware support

This page has notes for performing provisioning without hardware support.

= Introduction = It is desirable to allow people to do testing without requiring that they use extra hardware. The reason for this is that many people will not go to the extra trouble and expense of obtaining hardware, and the requirement will therefore dramatically shrink the pool of willing testers.

There are two main requirements for hardware for automated provisioning (and automated testing):
 * the ability to reboot the board
 * the ability to control menus for provisioning

I conducted some experiments to see whether sufficient software control was available to provide for mostly-automated provisioning. The idea was to use software control of the board for the majority of interactions, and fall back to user (manual) intervention if required.

That is, the experiement was to see if, under circumstances where the system did not hang or corrupt itself, the system could auto-recover sufficiently to return it to a "safe" mode, where it could be controlled via software processes only.

I would like to develop a set of criteria for what attributes of a system make this type of operation possible. It is possible that certain features could be designed-in to the system, to support software-only provisioning and board automation.

Most test labs use a serial connection to the board to control the bootloader, to select the kernel under test or to provide command line parameters or ram addresses. However, this requires that the board expose a serial port, and that the user install a serial cable from the management host to the board. Many products either do not expose a serial port at all, or the port is only accessible with great difficulty (like requiring soldering to the back of the board, after the product is taken apart).

A lot of test boards utilize system software that they retrieve from SDCards. A test system can avoid having to interact with the firmware or product menus on the device under test by using specialized hardware (called an SDMux) to switch access to storage between the target board's SD card slot and a host machine. SDMux hardware solves a large number of problems, but is also something most end users will not have.

= Note from grub-based provisioning, using a safe/test system = GRUB is a bootloader commonly used in Desktop Linux systems, to load Linux or other operating systems at system boot time. It provides a menu, which is based on a configuration file that is most often auto-generated by scripts on the target Linux system.

Grub notes
Grub boots using materials that are in (by convention) /boot in root filesystem of the machine being booted. The conventions for the /boot directory are that the kernel image, config file, System.map, and initrd have the version of the kernel as part of their filenames.

The grub menu is in the file: /boot/grub/grub.cfg, and it is auto-generated by scripts in /etc/grub.d. One of the main scripts is /etc/grub.d/10_linux, which collects all the image names from /boot and creates menu items for them, based on settings in the config file /etc/default/grub.

The tool used to rebuild the grub.cfg file is called 'update-grub'.

Grub is supposed to be able to read and write information to /boot/grub/grubenv, to allow for control of grub operation. This file has special properties that allow it to be read/written with minimal disruption to the filesystem in which it resides. (Grub accesses it using it's own and EFI/BIOS I/O routines).

Grub uses it's ability to read from grubenv to determine if it should boot to alternate images. Specifically, it uses data from grubenv to determine if a previous boot failed, to avoid booting into that same kernel. However, it is important for grub to be able to write to grubenv in order to use this feature. Grub can read more filesystem types than it can currently write to. (Specifically, Grub can not write to a btrfs filesystem.)

boot-once testing
Boot-once testing is a configuration for provisioning the board that uses two different kernels: a 'safe' kernel and a test kernel. The test kernel is written to flash (or SDcard or disk) using the safe kernel, and the user can always boot the board using the safe kernel if something goes wrong with the test kernel (ie it fails to boot).

The reason for using this configuration is that it allows for recovering from failed test kernels, without requiring any additional hardware.

Boot-once testing requires that the bootloader be able to distinguish when it should boot the safe kernel and the test kernel. It needs to use some piece of information to do this. If the bootloader has access to the network, then it could read a value from some external device.

grub can read a file called /boot/grub/grubenv to tell give it information for the next boot. Note that this file can be written to by a safe kernel, but not necessarily by a test kernel. Therefore, for the bootloader to be able to tell itself to boot a kernel only once, it needs to have the bootloader write whether it succeeded or not into the filesystem. Grub cannot write to grubenv in the btrfs filesystem, but it can in an ext4 filesystem.

The tool 'grub-set-default' was used to write "Test kernel" as the default boot kernel.

I always try to maintain a "safe" kernel as the 0th entry in the grub menu, as that will be what grub falls back to as it's default, grubenv doesn't specify a default, or if the previous default fails to boot (ie, the test kernel fails to boot)

The system I tried to use for doing automated provisioning consisted of 3 main parts:
 * image preparation and placement
 * grub menu rebuild
 * reboot logic

To provision the system, the host first boots the system into a "safe" kernel. (This step may be skipped to save time, if the test kernel appears to be functioning correctly and able to handle network traffic and filesystem operations).

I added a script to grub, called /etc/grub.d/50_test, that adds an entry for a test kernel.

The script adds a new menu entry called "Test kernel" to the grub menu, which expects the following files to be present on the system:
 * test-vmlinuz
 * test-initrd.img
 * test.dtb

The provisioning system places the kernel images into the /boot directory, with the indicated names (prefixed by 'test").

We can't use the conventional filename name for the test kernel (with the kernel version number), because the grub menu update process will find the test kernel and put it in with the list of other detected kernels, and it might put the test kernel first. Grub defaults to booting the first kernel in the list by default, and we always want the default to be a "safe" kernel (that we know will work on the device).

When the host prepares to reboot the system, it calls (on the target board) $ grub-editenv /boot/grub/grubenv set next_entry="Test kernel" to set the next boot to be to the test kernel.

Then the host calls (on the target board) 'reboot', to cause a software reboot of the system.

When grub boots, it reads 'next_entry' from grubenv, then clears the value for 'next_entry' in grubenv, and then boots the requested kernel.

If no other modifications are made to the grub menus or grubenv, the test kernel will boot only once.

If the kernel hangs, then the user may have to manually reboot the machine, but it should not require any user interaction with the grub menus. So hopefully at most the user has to cycle the power or push a button or something.

Every time the test automation system wants to boot into the test kernel, it uses grub-editenv to set "next_entry" in grubenv.

Note that for the prototype, these operations were put into the ttc configuration file, under separate target blocks for the same hardware machine: "pot1" and "pot1-test", respectively.

So, to reboot to "safe" mode, one executes: $ ttc pot1 reboot and to reboot to the test kernel, one executes: $ ttc pot1-test reboot

More details about this particular system and its configuration are at: http://fuegotest.org/wiki/Provisioning_notes_-_potato_board

Grub Issues (on potato)
The potato board sdcard image that I used (one with Ubuntu 18.04) was partitioned with a VFAT partition (for efi data), a BTRFS partitions (with 3 sub-volumes), and SWAP partition.

Since grub could not write to a file on the BTRFS filesystem, I had to repartition the drive. I made a small partition that was formatted as an 'ext4' filesystem, to hold the /boot directory for the system.

= Criteria for hardware-less updates (using a boot-once system) =
 * Firmware can only support boot-once mode if they have the following capabilities:
 * Ability to write a bit to a persistent location, before starting the system software (software under test)
 * In the case of the potato board, using grub, it was easiest to write the data needed to /boot/grub/grubenv
 * However, I had to repartition the root filesystem on the SDCard so that the /boot directory would be on a grub-writable filesystem.
 * Ability to perform a software reboot of the system
 * Some boards do not correctly reboot, when a reboot is requested.
 * Automation can be greatly enhanced if there is a hardware watchdog feature, that can reboot the board if the software under test hangs
 * This would alleviate the need for user intervention in the case of kernel hang
 * The watchdog feature would need to be controllable from the bootloader, so that it could be started before the software to be tested