Best practices for building resilience in remote IoT devices

According to market reports, in 2019 there were around 1.2 billion IoT devices connected to cellular networks, and it is expected to be 4.7 billion devices by end of 2030. Most of these devices were connected to the internet for remote deployment applications such as monitoring the growth of trees in a forest, stream, river, or lake-level monitoring or attached to a moving subject such as vehicles or containers.

There are unique challenges with these types of remotely deployed devices. Focusing on their development life cycles, initial development, and testing are done indoors or in controlled environments. In this phase, most of the firmware and hardware issues are resolved. In the second phase, a set of devices is deployed in the field for pilot testing. As these are deployed remotely, this phase's key challenge is to access the device to debug issues and apply fixes physically. This article discusses some of the recommended practices for IoT device development to better support the piloting period and beyond.

Example use case

Remotely deployed IoT devices could be based on a microprocessor with an embedded Operating System or bare metal firmware. For this discussion, let us assume a simple battery-powered microcontroller-based device. It measures temperature using a thermistor and is connected to an LTE-M modem to communicate with the IoT back-end system. The device can be visualized in the form of a simple block diagram, as shown below.

Figure 1: Block diagram of a simple IoT device that sends temperature data periodically via telecommunication network to a remote server

Following is a simple flow diagram summarizing the device operation with bare-metal firmware.

Figure 2: Firmware flow diagram

As shown in Figure 2, when the device powers up, the bootloader gets executed and passes the control over to the main firmware. In the main firmware, basic hardware gets initialized and sends the first packet to the back-end with respective data. Afterward, it would read temperature information from the thermistor and publish the data to the back-end. Then to conserve power, it will go into sleep mode after setting up Real Time Clock (RTC) wakeup alarm.

Best practices for building resilience

In this section we’ll discuss the best practices for efficiently diagnosing device issues to make the device is more resilient in the field.

Organize your data

Initial Packet

When the device executes the main firmware after hardware initialization, it is a good practice to send a data packet that includes fixed information such as the device's unique ID, firmware version, and modem-related info, such as IMEI, and the modem firmware version.

Other than this fixed information, reset-related information, such as the reset cause of the microcontroller, should be sent. When the device is deployed, analyzing the back-end data for the initial packet would help identify an unusual reset of the device, which could be a reset due to a watchdog, a brownout reset, or maybe a power-on reset due to loose contacts with the battery module.

Health information

It is a good practice to share the health status of the devices periodically. Considering the battery-saving aspects, it is better to append telemetry data with health information rather than a separate packet. This data could be:

Battery-related data, such as Voltage and Charging or Discharging Current
Communication-related data, such as Signal Strength, Connection Type (2G or LTE-M)
The internal temperature of the device, if present
Error counters, such as communication failures with the I2C temperature sensor

Any issues related to battery power, RF connectivity, and others can be identified based on the data present on the back-end.

Overcome device hang-ups

Watchdog and Crash Dump Creation

Watchdog is utilized in firmware to overcome unintended hang-ups or runaway code. This may be due to a malfunction of connected hardware or a bug in the firmware not discovered earlier. The typical usage of a watchdog would be to enable a watchdog at the very beginning of the firmware, or some microcontroller allows the watchdog to power up using fuses/user flash settings. The timeout period of the watchdog timer can be set, and the value depends upon the microcontroller's capability and how fast a reaction is required. Typically for IoT applications, four or eight seconds would work. After the watchdog is enabled and the parameters of the watchdog timer are set, the watchdog timer resets accordingly in every function or every looping condition call.

Although the watchdog resets the microcontroller to walk away from the bug/issue, the real reason for the reset is not exposed. As a result of the addition of the initial packet, device rebooting can be witnessed from the back-end server data, but it will not point to the cause of the issue. To overcome this, you can introduce a concept like a crash dump for microcontrollers.

Figure 3: Crash dump generation

The watchdog and microcontroller have early warning interrupts. These are invoked before the actual watchdog reset handler. For example, based on the registration configuration, watchdog timeout can be set at 8 seconds, and early watchdog interrupt can be set at 4 seconds. In such a configuration, if the watchdog interrupt is not cleared for 4 seconds, the watchdog's early interrupt would get executed. This watchdog's early interrupt will trigger Non-Maskable Interrupt (NMI), store stack information, and trace buffer data on the flash. This stored data dump can be sent as a part of the initial packet so the remote developer can access stack data and trace buffer data. NMI serves as the known entry point to restore stack data when debugging. Following is the list of data that can be saved as part of this crash dump:

Stack – Based on the typical firmware flow, this could be a few hundred bytes to a few kilobytes. Depending on the planned size for the crash dump payload, the stack memory area can be copied.
Stack pointer – The address of the current stack pointer at the point of crashing will be used in debugging to restore RAM data on debug system.
Timestamp – If the device is equipped with RTC, having a time stamp is beneficial to identify the time the crash dump was saved.
Micro trace buffer data – Micro trace buffer is one feature that comes with Arm Microcontrollers. This feature stores the last set of instructions that would have run and the address of these instructions is saved on the local RAM buffer. This would be the secondary method to pinpoint the exact set of instructions running on the verge of watchdog reset.

When a crash dump is received as part of the initial packet after the reset, the data can be analyzed using open-source tools such as openocd, gdb, and a microcontroller setup.

Global timeout

There are some other tricks you can work on with the watchdog in places where it is required to restart the device due to the malfunction of the associated hardware module. Refer to figure 2; typically, we disable the watchdog just before executing sleep instruction and enable it back after wakeup. If we know the typical duration for one data transmission cycle (read the temperature, and send that data to the back-end) and adding further buffer time we can come with a value, where the data transmission cycle should complete and go to sleep. for example assume typical transmission cycle takes 60 seconds and keeps a buffer, this flow should be completed within 5 minutes. If the device does not go in to sleep within 5 minutes after wakeup, we can assume the code is running but not progressing in the flow, potentially waiting for some trigger to happen. This could be due to a lack of implementation in timeouts or other untested condition, where program stays at the same step while signaling watchdog that code works properly. If the functions for enabling, disabling, and clearing of the watchdog timer is wrapped, a global timeout can be introduced considering the overall flow. At the time watchdog is enabled, note down the current time, and at each time the watchdog clears, the timer function is called to verify the program is within the global timeout. If the timeout is elapsed, stop clearing the watchdog timer, and this will result in rebooting of the device.

Reset individual/submodule at the startup

Another good practice is to reset any attached sensors and modules at the startup. This will make sure the associated peripherals are in known state at the beginning. For example, the LTE modem in the design can be turned off at the start, followed up with a typical switch-on process. This will help revive the device if it is restarted due to the failure of a connected submodule.

The Savior in the field

Over-the-air updates

Last but not least, over-the-air updates. The over-the-air update is a feature where you will push the new firmware securely and reliably to the device remotely. Once you have identified bugs or a new way of enhancing the battery life, it is required to push the new firmware binary to the device, which is miles and miles away from you.

Conclusion

IoT devices, especially those deployed to a remote location, should have mechanisms to share their health status in the event of device hang-up, overcome it, resume operation, and have the ability to update firmware remotely.

Digital Engineering

Intelligent Enterprise

Experience and Design

Article 9 Feb 2023 6 min read