TLDR

Here’s the code you probably want. Modify it as you see fit.

# Copyright 2022 Connor Rigby
# 
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
# 
#     http://www.apache.org/licenses/LICENSE-2.0
# 
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
defmodule MyFirmware.Validator do
  @moduledoc """
  Validates the currently running firmware as soon as
  the device connects to NervesHub. This is implemented
  via setting a callback in `:heart`.

  Validation is implemented by polling certain functions,
  namely: `NervesHubLink.connected?()`. It is given
  5 minutes to connect. If it does not connect, the `:heart`
  module will reboot the device via `nerves_heart`.

  All the code in this module must be **VERY SAFE** a crash
  will cause the device to reboot.
  """

  use GenServer
  require Logger

  # 5 minutes
  @nerves_hub_timeout_ms 300_000

  # shoudl be started in a supervisor spec
  @doc false
  def start_link(args, opts \\ [name: __MODULE__]) do
    GenServer.start_link(__MODULE__, args, opts)
  end

  @impl GenServer
  def terminate(_, _) do
    :heart.clear_callback()
  end

  @doc """
  This is the `:heart` callback entrypoint 
  """
  def heart(pid \\ __MODULE__) do
    safe_call(pid, :heart)
  end

  def safe_call(pid, call) when is_pid(pid) do
    if Process.alive?(pid) do
      try do
        GenServer.call(pid, call)
      catch
        type, error -> {:error, {type, error}}
      end
    else
      {:error, :not_alive}
    end
  end

  def safe_call(server, call) when is_atom(server) do
    if pid = Process.whereis(server) do
      safe_call(pid, call)
    else
      {:error, :no_pid}
    end
  end

  def safe_call(unknown, _call) do
    {:error, {:unknown, unknown}}
  end

  @impl GenServer
  def init(args) do
    nerves_hub_timeout = Keyword.get(args, :nerves_hub_timeout, @nerves_hub_timeout_ms)
    nerves_hub_timeout_timer = Process.send_after(self(), :nerves_hub_timeout, nerves_hub_timeout)

    # Add other timers here in the same format

    {:ok,
     %{
       timers: %{
         nerves_hub_timeout: nerves_hub_timeout_timer,
       }
     }}
  end

  @impl GenServer
  def handle_call(:heart, _from, state) do
    timers =
      Map.new(state.timers, fn
        {name, :ok} -> {name, :ok}
        {name, timer} when is_reference(timer) -> evaluate_timer(name, timer)
        {name, value} -> {name, value}
      end)

    state = %{state | timers: timers}

    failed =
      Enum.any?(timers, fn
        {_name, true} -> true
        {_name, _result} -> false
      end)

    if failed do
      Logger.error("Heart callback failed. Firmware will revert soon")
      {:reply, :fail, state}
    else
      # all checks passed
      {:reply, :ok, state}
    end
  end

  @impl GenServer
  def handle_info(:initialize_heart, state) do
    :heart.set_callback(__MODULE__, :heart)
    {:noreply, state}
  end

  def handle_info(:nerves_hub_timeout, state) do
    Logger.warn("Timeout connecting to NervesHub. Firmware should not be considered valid")
    {:noreply, %{state | timers: %{state.timers | nerves_hub_timeout: true}}}
  end

  # Timer already expired
  def evaluate_timer(name, true) do
    {name, true}
  end

  def evaluate_timer(:nerves_hub_timeout, timer) do
    try do
      if NervesHubLink.connected?() do
        Process.cancel_timer(timer)
        # this is what we've all been waiting for!
        Nerves.Runtime.validate_firmware()
        {:nerves_hub_timeout, :ok}
      else
        {:nerves_hub_timeout, timer}
      end
    catch
      type, error ->
        Logger.error("Failed to check nerves_hub_timeout: #{inspect({type, error})}")
        {:nerves_hub_timeout, timer}
    end
  end
end

Why, When and How

With Nerves, you get this fancy A/B partition scheme. You can think of it as analogous to blue/green deploys of web applications. How this works internally is subject for another post as it differes per device. In the case of this post, all you will need to know is that if we don’t call a special function, upon the next reboot, the device will revert to it’s previous firmware.

Why have a system to auto revert firmware?

To start out, it may be useful to understand why this setup exists. Imagine if you will, you have a fleet of devices in production. What they do is not important, but if you’re creative, you may pretend they do something cool. If you’re not creative, just assume that a broken firmware means you have to personally go out and fix any device personally. This is your motivation.

The general idea is that if your device is online, and able to download a new update, it’s in a “valid” state. Say the device is on firmware A. It was the first version of the firmware you wrote. It has bugs, but those aren’t important as you can just fix them with an update. Firmware A is good enough to get you connected to a central Firmware Update Server. (say for example Nerves Hub) Since this was the first firmware, it’s automatically considered valid. Now that firmware A is deployed to your fleet of devices, you really don’t want an update to break them. This is where the auto revert system comes in. When you finally get around to fixing those bugs, you can use the Firmware Update Server to dispatch your update to the devices, but you want to be really sure that they are at least as not broken as they started out before the update.

When an update is downloaded, it will be applied to the B partition, and the device will attempt to boot from that partition after the update completes. When it does, there are some conditions that need to be met before considering the new firmware as valid.

When is a firmware considered valid?

The short answer is of course it depends.

The short answer that is probably most useful to you is that if your devices can receive further updates, it’s what i like to call valid enough.

The long answer is as follows:

You ultimately need to decide what makes your firmware valid. The code provided in the above example simply assumes that connecting to NervesHub is what makes it valid. Your use case will probably differ depending on what the device does. For example, some common other checks include connecting to your own networks, APIs etc.

If your device connects to your Firmware Update Server, but doesn’t perform it’s core functionality, maybe that shouldn’t be considered valid.

How to validate a firmware?

Naturally, the answer to this question is it depends yet again. However, the example above is of course already implemented, so that’s how you’re gonna do it. The point here is that this is not the only way to validate a firmware. It’s just one I and at least a couple other production projects work.

The main system we will be working with here is called heart. It’s an underappreciated system in the Erlang Runtime System with almost no documentation. (as is customary for the most useful parts of ERTS)

What you need to know is that there’s a module called :heart that gets started very early in the boot process. Nerves implements a custom process (source) to keep :heart and your devices watchdog in sync. This means that if Erlang (read: your firmware) or the device watchdog becomes unresponsive, the device will reboot. The special part about that, is that if your firmware was not validated, the reboot will revert back to the last valid firmware, protecting you, the developer from having to fix devices manually.

So how do you use it? there are a couple functions you will need to know about. The glue between them is really up to you, but the example at the beginning provides a basic implementation you can use and modify to suit your own needs.

The first useful function is :heart.set_callback/2:

:heart.set_callback(SomeModuleThatKnowsHowToValidateFirmware, :function_to_call)

This callback will be called every HEART_BEAT_TIMEOUT. By default this is once every 60 seconds.

The other useful function you will need is Nerves.Runtime.validate_fw/0:

Nerves.Runtime.validate_fw()

In the above example, we wrap both of these functions up inside a GenServer process. this process will be started during our firmware’s application supervision tree startup. I put it at the very end so that firmware can only be validated if everything else is “up and running” whatever that means for the application. That process schedules some timers that once expired will consider the firmware “invalid”. The whole trick here is that your device will not be connected immediately since the network takes time to come up. The timer essentially says that

upon a reboot, if the device hasn't connected to the firmware update server in
the allowed amount of time (5 minutes in this case), the firmware should be reverted.

The other thing to note here is that any crash, exception, error etc will be considered a failure. (and cause a reboot) This means you should think about how the process interacts and introspects the rest of the system.

Conclusion

Hopefully this at least gets you thinking about how to recover from failure before you end up failing with no escape route.

Deploying firmware to production devices has quite a few things like this that you may not even be considering. Stay tuned for more on deploying your firmware to production