The Server Side of Over-the-Air Updates

Episode 50: Better Built By Burkhard

Feb 19, 2024

Dear Reader,

In the last four weeks, I got the OTA update of the rootfs image working reliably. The developers of my customer use the OTA update for 7 devices. By now, I can perform offline updates from a USB drive.

I spent a lot more time integrating the main customer application and my update library into the Yocto build than implementing the update functionality itself. Even after years of working with Yocto, it is still a black hole for my time 😮‍💨

At least, I learned a couple things like exporting and importing CMake modules through Yocto recipes. If you want to or have to understand this as well, don’t worry. I am planning to write a blog post or two about it.

Happy reading,
Burkhard

The Server-Side of Over-the-Air Updates

Context

In the last episode, I looked at the client-side of OTA updates: the double-copy (A/B) and single-copy-with-rescue update strategies, updating the boot loader, offline updates, and automatic or interactive updates. In this episode, I’ll introduce you to the server side of OTA updates.

You can build a rootfs image - e.g., in ext4 format - and a boot loader image - e.g., for u-boot - for your embedded Linux system with Yocto or any other build system. The build generates an update archive that contains a configuration file, the rootfs image and the boot loader image. The configuration file specifies the versions of the images, the partition to which the images are installed, and the hardware version of the board on which the images run fine.

For a full update, the archive contains the full rootfs and boot loader images. For a delta update, the archive contains a list of checksums to figure out which blocks of the rootfs image have changed. For a partial update, the archive contains a tarball of the new applications, libraries and auxiliary files. Both delta and partial updates reduce the size of the archive significantly and save download bandwidth. As usual, this plus doesn’t come for free.

Once the build has generated an archive for a full, delta or partial update, you upload it to the fleet management server manually or let the CI/CD system upload it automatically. The server rolls out the update to the eligible devices group by group. You want to start the rollout with a small group so that only few devices are affected if the update goes wrong or the new software has problems. Once the update works for the small group, you can roll it out to bigger and bigger groups. This is called a staged rollout.

The devices check regularly whether an update applicable to their hardware and software configuration is available on the server. The update procedure for a battery-powered handheld device could go through the following steps.

Some time (e.g., 5 minutes) after booting, the device checks whether an update is available. If so, it continues with Step 2. Otherwise, it waits for the next power cycle or the next day.
The device notifies the user about the availability of the update, e.g., by popping up a dialog.
In the dialog, the user can choose to install the update right now, install it during the night (e.g., at 3 am), or defer the update to the next power cycle or the next day. Users may defer the update only twice. They must install the update on the third attempt. This ensures that the device gets security or safety critical updates timely.
When the scheduled installation time comes, the device installs the update into the inactive partition (assuming an A/B strategy), makes the inactive partition the active one and reboots into the partition with the updated images. The device should perform the update only, if it is connected to mains or the battery level is over a certain threshold.
The user works with the updated system.

For a mains-powered device with a high-bandwidth Internet connection, the procedure could look different. The device installs the update in the inactive partition right away without user interaction. When the device is rebooted the next time, it starts the update system. The user shouldn’t notice anything - except for some new features, some smoother interaction or some bugs fixed.

The update procedure depends heavily on how a machine or device is used. You want to define the update procedure together with your customers before you start its implementation.

Full, Delta or Partial Updates

Boot loader updates are always full updates. They are never delta or partial updates. What I write about full rootfs updates also applies to boot loader updates.

Full Updates

The update archive (e.g., a CPIO archive for SwUpdate) contains a full rootfs image, e.g., as a zipped ext4 image (e.g., rootfs.ext4.gz). The update client writes the image into the inactive partition of the eMMC storage (e.g., /dev/mmcblk2p2). The name of the rootfs image, the partition, the software version and the hardware version are specified in the configuration file included in the archive.

Rootfs images can be huge, especially when not minimised. For example, the rootfs image built by a recent customer - based on the Boot2Qt image for the Variscite DART-MX8M-PLUS - had a compressed size of 1 GB and an uncompressed size of roughly 3 GB! The image contained gcc, gdb, valgrind, Qt WebEngine and Qt bloatware (pardon, Qt example applications). An hour of minimising the image brought down the size to 350 MB compressed and roughly 1 GB uncompressed. With a little bit more effort, I could bring down the uncompressed size to 250-500 MB.

Such huge sizes have a couple of negative consequences. They require high-bandwidth Internet connections for acceptable download times. Construction, forestry and agricultural machines can only dream of high-bandwidth connections. 2G or 3G connections are still common. As modems consume a lot of power, long updates deplete the device battery pretty quickly.

As mobile Internet contracts are priced by volume, huge sizes drive up the costs quickly. For example, a manufacturer provided plans with world-wide coverage for their harvesters. The annual volume limit was 1 GB. The manufacturer paid the plans for the lifetime of their harvesters, that is, for 15 years or more. This ensured that the manufacturer didn’t have to rely on the drivers’ data plans for updates. Larger volumes would have rendered OTA updates unprofitable. The manufacturer used partial updates during the two-month harvesting period and full updates only when the harvester was connected to WLAN during maintenance.

The image size directly impacts the size of the internal eMMC storage - double for A/B partitions. The price for a 4GB eMMC chip with a 1/1/2 GB partitioning (A/B/Data) starts at €4.03 at 1000 units. The price for an 8GB chip with a 2/2/4 GB partitioning starts at €5.39. This is a difference of €1.36 per unit or €13.600 for 10.000 units. The costs are adding up quickly.

It starts to sound as if you shouldn’t use full image update at all. Quite the opposite! With full updates, you will never run into situations, where the new application may not work with the installed libraries or the installed application not with the new libraries. The image build would fail in these and other cases. A full update guarantees a level of consistency that you cannot reach with partial updates. Delta updates have the same consistency level as full updates, but they make things a bit more complicated.

All update clients support downloading the update archive to a partition with enough free space like the data partition, extracting the rootfs image from the archive and writing the binary ext4 image to the inactive partition. The last step is nothing else but an offline update. The better clients like SwUpdate, RAUC and Mender stream the data blocks of the rootfs image directly into the inactive partition. These clients save the space for downloading the rootfs image.

Check out the post OTA for Embedded Linux Devices: A practical Introduction by Thomas Sarlandie how to perform an OTA update with an SwUpdate client and a Memfault server.

Delta Updates

The build applies zchunk to the binary rootfs image. Zchunk splits the image into chunks, compresses the chunks and computes the checksum for each chunk. The build adds the list of all checksums instead of the rootfs image to the update archive. The update archive is tiny compared to the one for a full update.

For each chunk of the inactive rootfs partition, the SwUpdate client calculates the zchunk checksum and compares it with the checksum from the archive. If the two checksums differ, the client reads the chunk from the server and writes it directly to the right location in the inactive partition.

Delta updates work best with a read-only rootfs. If applications change files in the rootfs, the update client overrides the corresponding chunk. So, changes to setting files, user data and databases will be lost after the update. Users won’t be amused.

Yocto provides the image feature read-only-rootfs to create a read-only rootfs. The data partition and standard directories like /tmp and /var/run are always writable. You should place all the files changed by your applications into these writable locations.

Delta updates minimise the amount of data transferred from the server to the client. They reduce the data volume and hence the price for mobile data plans. Moreover, they enable updates for Internet connections with lower bandwidths. Delta updates ensure that application and library versions are consistent. Like full updates but unlike partial updates, delta updates can downgrade a device to an older version, because they don’t care about the installed version.

You pay for all these advantages with a slightly higher complexity compared to full updates. The Yocto task for creating the update archive must run zchunk on the rootfs image and include the checksums in the archive instead of the rootfs image. You must change your applications so that the rootfs can be read-only. This is a small price to pay for the considerable advantages of delta updates over the other two types.

Check out the post Delta OTA Update with SWUpdate by Andrew Murray how to prepare an SwUpdate archive with checksums and how to build SwUpdate with zchunk support.

Partial Updates

You build your applications, libraries and other files against the SDK created from the rootfs and pack some or all of these files into a tarball. The tarball replaces the rootfs image in the update archive. The configuration file specifies in which directory the update client shall unpack the tarball. This is called a partial update.

Partial updates have a very small footprint, probably even smaller than delta updates. However, they can and will create inconsistencies. You might put the new application version into the tarball but forget one of the changed libraries. Then, the application tries to access the library through an old interface and misbehaves in strange ways or even crashes. The opposite situation with new library versions and an old application version causes similar problems.

It is easier to mess up than you think. Even if you specify only certain Qt modules in the image recipe, the Qt class populate_sdk_qt6 (similar for Qt5), by default, adds all Qt modules into the SDK. Developers do not even know that they use Qt modules that are not on the device. Say hello to the next crash of your application!

Given how error-prone partial updates are, I don’t recommend them. My preferred solution are delta updates - with full updates as a simple and robust alternative.

Fleet Management Servers

Server-Side Solutions

The Memfault server works with different update clients like SwUpdate, RAUC and Mender. This multi-client solution makes it easy to migrate your fleet from one client to another. You might want to do this, because you are unhappy with the pricing or because the client or server lack a must-have feature.

The Memfault server is built on top of the free and open-source hawkBit server. You could host the hawkBit server on your premises and use it for managing device updates. This would save you the fees for a commercial server like Memfault, Mender or Balena. But you would lose a powerful UI for fleet management, device monitoring, alarms, crash reports, logs and many other features. You would be responsible for maintaining and scaling the server and for adding the most valuable features. It is probably cheaper to pay the professionals.

The Mender server works best with the Mender client. As the Mender client is open-source, it should be possible to use a different client. However, this is not the intended use. Mender touts its OTA update solution as an end-to-end solution: Mender client and Mender server. This is a single-client solution.

I won’t go into a feature comparison of the different server solutions, although sales people like to dwell on feature tables. All solutions are good enough for managing OTA updates of device fleets. The selection boils down to pricing, rapport with the vendor - and how easy it is to integrate the client with the server on a specific board.

Ideally, the server and SoC vendors provides an out-of-the-box integration of the update client into embedded Linux systems built with Yocto. The only things you should have to configure are the size and name of the rootfs and data partitions, the software version and the hardware version. The build generates a script for partitioning the internal eMMC storage according to the specification and the configuration file for the update archive with the given versions. Furthermore, the build compiles the update client and creates a script to determine the hardware version and the ID of the board. The ID could be the MAC address of the (W)LAN chip or the ID of the SoC.

Variscite and Memfault come pretty close to this ideal with an integration of the SwUpdate client. I had my first OTA update working within 2 days. I achieved this just by following the documentation.

Mender are far away from the ideal. They complicate the partitioning, because they map a Mender-specific partitioning configuration to the official Yocto configuration with numerous functions spread over several classes and recipes. I couldn’t figure out the right configuration for a 4/4/8 GB partitioning in four days and gave up. This is quite the opposite of the end-to-end solution marketed by the Mender sales people!

An Example Dashboard

The screenshot above shows the Memfault dashboard of my current project. It shows 4 of the 8 devices that receive OTA updates. The first column shows the ID of the device ID, which is the SoC ID.

The 8 devices are grouped into two cohorts: update_test and default. The first cohort contains my device. I use this cohort for testing new update features. The second cohort contains the other 7 devices, which are used by my customer’s developers. A better name for the second cohort would be developers or dev. For each product, the manufacturer could introduce a cohort with the name of the product.

The third column gives the version of the rootfs image installed on the device. The devices are on different versions. One device is on the latest version 1.1.7. The three other devices could update to the latest version. The fourth column shows the hardware version. If you modify your board, you will probably give it a code name different from the standard board name. For example, you could use vanilla-alpha for the first revision of the board, vanilla-beta for the second revision and so on. The last column tells you how long ago the device was last seen.

The screenshot above shows that 5 devices of the default cohort are on the old version 1.1.5 and could be updated to the latest version 1.1.6. 2 Devices are already on the latest version.

When a device checks whether an update is available, it sends its ID, software version and hardware version to the Memfault server. The server checks whether a newer version is available and sends the result back to the device. If an update is available, the client installs the update at the scheduled time. Otherwise, it waits a given interval (e.g., until the next reboot or 1 day) until it repeats the availability check.

The Memfault dashboard offers many more helpful views. The Memfault Interactive Sandbox lets you play around with the dashboard.

Rollouts

You should never roll out an update to all devices. If the update is buggy, all devices will have issues and you’ll get many angry calls from your customers. You would maximise the damage.

So, the wise approach is to perform a staged rollout. You roll out a new version to a small group of devices first and then incrementally to larger and larger groups. The better servers can randomly select the devices for the groups and run the rollout automatically. Alternatively, you could create the groups by dynamic filters or use the statically defined cohorts and run the rollout manually.

Staged rollouts minimise the damage. Only a few devices and a few customers are affected by update issues. You get early feedback about possible issues from the smaller groups. If there are issues, you can abort the update with a single click and prevent devices still on the old version from running into the known issues. Then, you can analyse the update logs, fix the problem and start a staged rollout for the fixed version.

Better Built By Burkhard