Dear Reader,
I wish you a happy, healthy and successful year 2024!
For me the new year started as the old year ended: with a lot of interesting work. Since late October, I have been building an OTA update solution for a customer. I advised the customer in selecting a solution. The most interesting contenders were Mender and Memfault. The customer chose the Memfault server with an SwUpdate client. The main reasons were:
Based on good documentation and supported by the Memfault engineers, I could quickly get the SwUpdate client working with the Memfault server. I failed with Mender.
During the sales process, my customer and I gained some insights how other Memfault customers do OTA updates. This will help us to avoid some wrong paths. Memfault sales knew the pros and cons of their competitors’ solution. In contrast, Mender sales asked us why we would even consider OTA update solutions other than Mender.
The above two reasons were enough for my customer to rule out Mender. The much higher costs of the Mender solution made the decision a no-brainer.
Last week, we installed my OTA update solution on their first five devices used by their developers. The solution is working reasonably well. We found some new error scenarios when updating their devices. This early and regular feedback is great and will help us eliminate more strange error scenarios long before the product release.
In this episode of the newsletter, I’ll look at OTA updates from the device or client side. In one of the next episodes, I’ll cover the server side and the requirements for a fleet management server.
Happy reading,
Burkhard
The Client Side of Over-The-Air Updates
Context
The fully electric ID.3 is VW's first car built on vw.OS. VW are facing massive software problems. They didn't even get the over-the-air (OTA) update working in time for the first batch of 10,000 cars. So, VW must update these 10,000 cars manually! Most likely, they'll have to update the second batch of 10,000 cars manually as well.
From Episode 3 of my newsletter (February 2020)
In the end, VW had to update more than 20,000 ID.3 cars. Let us assume that the update of one car takes an optimistic 15 minutes. Then, VW wasted 625 work days or more than 3 work years!
Only half jokingly, I tell my customers that they must get a single thing right for their first product release: Their product need not work but the over-the-air (OTA) update must work reliably. As new features become available or bugs are fixed, users install the updates on the devices or the devices install them automatically. Users get more value into their hands quicker. Manufacturers save a lot of time in support and get quicker feedback about the new features.
VW is not alone in doing manual updates from USB drives or laptops hooked up to the device. For many manufacturers, the preferred way to update their machines is from a USB drive with some DIY software from the SoC, SoM or terminal makers. Manual offline updates make sense, if devices have only temporary or no network access or if the OTA update didn’t work because of some bug.
In general, however, manual updates are costly and error-prone - and can soon get manufacturers into legal troubles. The EU Cyber Resilience Act (EU CRA) requires manufacturers to fix security vulnerabilities quickly and to ensure that the fix is installed on their devices. Manufacturers can’t guarantee that if they send the USB drive by DHL or UPS and rely on their customers to install the fix. The penalties for fixing vulnerabilities too late or not fixing them at all are high. They can be up to €15,000,000 or 2.5% of a company’s worldwide annual sales, whichever is higher.
As usual, my focus is on embedded HMI systems (a.k.a. operator terminals) running Linux a microprocessor (e.g., NXP iMX6, iMX7 or iMX8). Examples are the driver terminals for agricultural, construction and industrial machines, tablets for UV cleaning robots and remote controls for cranes. These devices can be mains or battery powered.
A driver terminal of a harvester is also responsible for updating the other ECUs over CAN bus. Although ready-made and well-tested libraries for DFU (device firmware update) are available for CAN, Bluetooth, USB and other communication protocols, manufacturers tend to develop their own solutions. For each ECU, the terminal downloads the firmware and uses the DFU software to install the firmware on the ECU.
An end-to-end update solution has a client running on the device and a fleet management server hosted in the cloud. The world leader for OTA updates is Mender. The Mender server only works with the Mender client. The challenger is Memfault. The Memfault server works with many different clients like Mender, SwUpdate, RAUC and OSTree. The Bootlin blog has technical posts how to perform updates with SwUpdate, Mender and RAUC.
Update Strategies
Double-Copy or A/B Strategy
For the double-copy or A/B update strategy, you create two partitions A and B on the internal flash storage. Each partition can hold the full image of the Linux system including the root file system, device tree and kernel. On power-up, the boot loader starts, say, the Linux system in partition A. Then, partition A is the active partition containing the active copy and B is the inactive partition containing the inactive copy or no copy at all.
When a user performs an update, the update client installs the new system image (e.g., rootfs.ext4.gz) into the inactive partition. When the installation has finished successfully, the client changes the boot partition to the inactive partition (that is, it makes the yellow arrow point to system B) and reboots the device. The device boots into partition B and runs the new system B. B is now the active partition and A the inactive one.
The update is atomic. The client only flips the boot partition if the new image was fully installed in the inactive partition. If there was an error due to a power cut or bad connectivity, the client doesn’t change the boot partition. Hence, it doesn’t matter whether a broken copy was installed in the inactive partition. The system runs on the active and working partition and reboots into it.
If the boot loader fails to boot into the new system for a given number of times, that is, if the device has reached its boot limit, the boot loader must automatically reboot into the previously working partition (see slides 19 and 20, U-Boot bootcount and bootlimit, of Implementing A/B System Updates with U-Boot for more details). In the running example, the boot loader fails to boot into system B three times. Then, the boot loader automatically boots the device into system A, which is known to work.
Now consider the case that the boot loader successfully starts the Linux system from the updated partition but the main application keeps crashing. You can configure u-boot to enable a watchdog in the Linux system. The main application must write to the watchdog device file regularly. If it fails to do so for a specified interval, the watchdog will reboot the system. The boot counter is increased. When the device reaches its boot limit, u-boot starts the other Linux system (see slide 22, Using a watchdog, of Implementing A/B System Updates with U-Boot for more details).
Instead of relying on the watchdog, users can reboot the device until the boot limit is reached. Then, the boot loader boots the other working system. Users can then install a new system image that fixes the crash of the main application.
Typically, you would create a third partition for user data, the data partition. Updates leave the data partition untouched. The data partition gets all the space left over by the two system partitions. The Linux images provided by the SoC, SoM and terminal makers easily require 2 GB of flash storage and more. With a bit of effort, you can reduce the sizes to less than 1 GB or even less than 500 MB.
For a 16 GB eMMC storage, you could use a partitioning of 2 GB for partition A, 2 GB for partition B and 12 GB for the data partition. Using 2 GB for the system partitions leaves enough space for the lifetime of the device (say, 10-15 years). For a 4 GB storage, a 1/1/2 GB partitioning would do.
Having two full copies of the system is the one disadvantage of the A/B update strategy. Internal storage sizes of 4-16 GB are pretty normal for the iMX family of microprocessors and their competitors. So, “wasting” 1-4 GB isn’t much of an issue any more. And don’t forget: A/B updates are atomic and fairly easy to set up. That’s why the A/B update strategy is by far the most popular approach.
Single-Copy-with-Rescue Strategy
The single-copy-with-rescue strategy replaces system A from the A/B strategy with a rescue system. The rescue system is a minimal Linux system that installs full-system updates into partition B. Ideally, you will never update the rescue system through the lifetime of the device, because there is no fallback if the update fails.
The rescue system normally fits into 100 MB and less. You can save 1-4GB of flash storage compared to a full-system update. But you pay a price for the storage savings: You don’t have atomic updates any more and the update process is more complicated for the users.
Users must go through the following steps to update the full system B.
The user reboots the device into the rescue system.
The user performs the update from an application running in the rescue system.
If the installation of the update succeeded, the user reboots into the full system B.
If the installation of the update failed, the user stays in the rescue system and retries the installation of the update once the update server is available again.
Booting into the rescue system and back into the full system can be realised in hardware or in software.
Hardware option:
Booting into the rescue system: While holding the power button pressed for a couple of seconds, the user holds down a second button.
Booting into the full system: The user holds the power button pressed for a couple of seconds.
Software option:
Booting into the rescue system: The user triggers the reboot from an application running in the full system. The application sets a variable in the boot loader to “rescue”. The user presses the reboot button for a couple of seconds.
Booting into the full system: The user triggers the reboot from the update application running in the rescue system. The application resets the boot loader variable to “normal”. The user presses the reboot button for a couple of seconds.
The hardware option needs an additional button (typically connected to a GPIO) and some custom circuitry to coordinate the extra button and the power button. The production boards of the SoM and SoC makers hardly ever have this feature. So, you can design your own custom board or buy an operator terminal with a customised board from the terminal makers like Topcon, CrossControl, Christ Electronics and many others - for a premium. You don’t have these extra costs with the software option.
The fallback mechanisms for the A/B strategy work for the single-copy-with-rescue strategy as well. When the boot count reaches the boot limit for booting the full system due to reboots triggered by a user or a watchdog, the device automatically boots into the rescue system. Of course, users cannot do any productive work in the rescue system. It’s downtime for users.
If reducing the size of the flash storage to the bare minimum is your topmost priority, the single-copy-with-rescue strategy is worth considering. Be aware, however, that your users have longer downtimes if the installation fails or if the updated system doesn’t work correctly. There is no second fully working system on the device. The question you must answer in the beginning is: Are the cost savings for a smaller flash storage worth the lower robustness of the updates, the worse user experience and the longer downtimes for the users? My answer: hardly ever.
Updating the Boot Loader
My explanations in this section are based on the excellent article Considerations for Updating the Bootloader Over-the-Air (OTA) by Drew Moseley. The article equips you with the technical details how to update eMMC boot partitions.
No Boot Loader Updates
Many companies stick with the same boot loader for the whole lifetime of their product. They will never update the boot loader, because the risk of making the device unusable is too high for them. The boot loader is a single point of failure. Repairing such a failure is expensive - maybe even so expensive that it’s better to replace the device.
Multi-Stage Boot Loaders
You can reduce the probability of having to update the boot loader by splitting it up into two boot loaders. The stage-1 boot loader or secondary program loader (SPL) is a bare minimum boot loader that just knows how to start the stage-2 boot loader. Both system A and B contain their own copy of the stage-2 boot loader. The stage-2 boot loader could be part of the system image - in addition to the root file system, device tree and kernel. This enables you to apply the A/B strategy to the stage-2 boot loaders as well.
While the stage-1 boot loader provides the bare minimum functionality, the stage-2 boot loader provides all the advanced features of modern boot loaders. Hence, there is a small risk that the stage-2 boot loader has a security vulnerability and needs updating. This risk is negligible for the stage-1 boot loader, which is never updated.
Parallel Boot Loaders
Another way to recover from a failed boot loader update is to have parallel boot loaders (Option 3 in Moseley’s article). Many SoMs - including my Variscite iMX8M Plus and Nano - can be booted both from SD card and from their internal flash storage. Both the SD card and the flash storage include their own boot loader. Users switch between the two boot options by changing a DIP switch (during development) or by holding a button pressed during reboot (in the final product).
If an update bricks booting from the internal storage, the user shuts down the device by pressing the power button for a couple of seconds, plugs in the SD card and keeps a special button pressed while pressing the power button. The procedure may vary from device to device. Then, the device boots from the SD card. An update application on the SD card lets users install a new image in the internal flash storage. You can always unbrick the device by burning a new image on the SD card and booting from it.
The program for starting the system either from the SD card or from the internal flash storage could be an SPL (stage-1 boot loader) or similar. It is often stored in an EEPROM and cannot be changed easily. This program is so simple and so well-tested that it need not be updated during the lifetime of the device.
eMMC Boot Partitions
The eMMC specification requires SoMs to have two boot partitions. So, all SoMs with an internal eMMC storage like the iMX8M SoMs have two boot partitions (see Option 4 of Moseley’s article). The typical size for a boot partition is 4 MB. You can flash the boot partitions and select the active boot partition from Linux user space.
The following diagram wraps up how to partition the eMMC storage of an iMX8M SoM, how to write the root file system images to the system partitions and the boot loader images to the boot partitions, and how to select between the system and the boot partitions.
This approach makes updating boot partitions atomic. If updating the inactive boot partition fails, there is always the active boot partition that is known to work. However, there is no automatic way to boot into the other boot partition, if booting from the newly updated partition fails.
If you test the new boot image properly and make developers and internal users install it on their devices before a wider rollout, the probability of bricking the device tends towards zero. Fixing bricked devices becomes easy, if you have parallel boot loaders. In short, eMMC boot partitions are the safest way to update boot partitions - especially when backed up with parallel boot loaders.
Offline Updates
The update strategies for OTA updates also apply to offline updates - with the same caveats. The only difference is that offline updates read the system image from a USB drive or an SD card plugged into the device instead of streaming it from a server.
Offline updates are useful, if devices are not connected to the Internet, the Internet connection is intermittent, the bandwidth is not high enough or the prices for mobile Internet connections are too high. These problems often occur with agricultural and construction machines in remote areas. Even industrial machines used indoors need not have an Internet connection, maybe because the building was built before IoT became all the rage.
Don’t assume that your customers have the same seamless mobile Internet connectivity as you have in your place. If Internet is not available, offline updates will save the day.
Automatic or Interactive Updates
In his must-read article OTA for Embedded Linux Devices: A practical Introduction, Thomas Sarlandie gives a step-by-step description how an SwUpdate client performs OTA updates with a Memfault server. He implements a systemd service that automatically installs the new image in the inactive partition when an update is available and reboots the device automatically once the installation of the update has finished successfully.
This behaviour is fine to get OTA updates with an SwUpdate client and a Memfault server working for a prototype, but it is most likely not what you want to see on your product. Rebooting the device in the middle of someone’s work will make users angry. Users should decide when to reboot the device.
Even downloading a huge image from a server and installing it in the inactive partition while users continue working with the active system is more often than not a bad decision. The update will increase the CPU load, will use most of the bandwidth of the Internet connection and slow down the I/O with the internal storage. The device becomes less responsive to user interaction. The extra load caused by the update drains the battery of battery-powered devices faster. Users may have to stop working or the installation fails because of an empty battery. Neither case should happen.
A more interactive approach could ask users, whether they want to install an available update now or during the night or any other suitable time. Another option may allow users to defer the update by a day. When the user has deferred the update three times, the installation of the update is forced at a user-selected time. Battery-powered devices should only enable updates if the battery level is high enough or the device is connected to mains.
If you are sure that nobody works at, say, 2 am, the system can perform the update at that time without any user interaction. When the user comes in the next morning, either the device runs the newly update system image or it has fallen back to the previous image because of too many boot errors. The user can start working in the morning as if nothing has changed.
There are many approaches with different levels of automation and interaction to update devices. These approaches will differ from product to product. You need to decide early in your product development on your specific approach.
Remember: The single thing that must work 100% reliably on your device is the OTA update. You can then use the OTA update to bootstrap everything else - including your core application. Learn from VW!
Hi,
A very nice article!
Memfault seems not to do the best work in advertising themselves though while I have to admit that this is the first time I hear about them despite of some quite extensive googling & reading around robust SW updates :).
A comment regarding the single copy + rescue approach. It seems that one additional drawback of that would be that the rescue will probably have almost similar security requirements as the main OS. That would mean that also Rescue has to receive regular security updates similar to main OS. That would then lead to having one more OS to be maintained and increase the risk of a non-atomic, no roll back update going south on field and all this to save couple of Euros on the flash cost.