OTA Updates for IoT: The Safe Way to Ship Firmware

By Samuel Labant Published Jan 9, 2026 Updated Jan 9, 2026

OTA updates are where embedded engineering meets reality: unreliable networks, power loss, flash wear, and security threats. Done well, OTA lets you ship fixes and features for years. Done poorly, it creates fleets of bricked devices and “works on my bench” firmware you can’t safely evolve. This guide focuses on the patterns that keep OTA updates recoverable, verifiable, and repeatable: versioning, rollback, atomic installs, staged rollouts, and the minimum security chain that prevents accidental (or malicious) installs.

Quickstart

If you only implement a few things, implement these. They’re the highest-leverage choices that separate “OTA demo” from “OTA you can trust in the field”.

1) Pick a brick-proof update strategy

Most bricking happens when you overwrite the only bootable firmware and lose power mid-write. Avoid that by making updates non-destructive.

Best default: A/B slots (two firmware banks) + boot flags
Keep a minimal bootloader + recovery path separate from app firmware
Never mark “new firmware” as permanent until it successfully boots

2) Verify every update (always)

Transport security (TLS) is not enough. You want end-to-end assurance that the bytes on flash are exactly what you shipped.

Require a signed manifest (device verifies signature)
Hash the payload (SHA-256) and verify after download
Bind updates to a device model/hardware revision to avoid “wrong target” installs

3) Add rollback + “trial boot”

Assume something will go wrong: corrupted downloads, incompatible config migrations, or unexpected hardware variants. Rollback turns failures into recovery.

Boot into new firmware as trial (not confirmed)
Confirm only after health checks pass (e.g., Wi-Fi + sensors + watchdog)
On crash/reboot loops, automatically revert to last known-good slot

4) Ship like a backend team: staged rollouts

OTA is deployment. Treat it like one: canary, ramp, observe, then widen.

Release to 1–5% first (canary) and watch failure rates
Roll forward or halt by server policy (no new firmware needed)
Keep telemetry small: boot success, version, update reason codes

A useful rule of thumb

If your OTA system can’t recover from power loss during install and bad firmware after install, it’s not production-ready—no matter how good the UI looks.

Overview

“OTA Updates for IoT: The Safe Way to Ship Firmware” is really about one thing: making firmware upgrades boring—predictable, reversible, and auditable. The hard part isn’t downloading a file over HTTP. The hard part is the edge cases: dead batteries, flaky LTE, flash wear, partial writes, incompatible config migrations, and attackers trying to install modified firmware.

What you’ll build (mentally)

A minimal OTA architecture: cloud manifest → device agent → bootloader decision
A versioning scheme that supports rollback, reproducibility, and fleet targeting
An atomic install workflow that is resilient to interruptions
Practical rollout controls: canary, pause, enforce minimum versions, and “blocklist” bad builds

Pattern	What it solves	Tradeoffs
A/B slots (dual-bank)	Power-loss safe installs + automatic rollback	Needs extra flash for a second image
Single-slot + recovery	Lower flash usage, simpler layout	Harder to make truly interruption-safe; recovery must be solid
Delta updates (patches)	Lower bandwidth and faster downloads	More complexity; needs careful validation and fallback plan
Signed manifest	Prevents modified/wrong-target firmware installs	Key management is real work (but worth it)

Think in failure modes

Design OTA by listing the ways it can fail (power, network, bad image, wrong image, attacker, storage full), then making sure every failure leads to a safe state.

Core concepts

1) OTA architecture (the three actors)

A reliable OTA system has three moving parts. Each has a clear job, and none should “trust” the others blindly:

Update server

Publishes a manifest describing what to install
Targets devices by model/region/channel
Controls rollout (canary, ramp, pause, blocklist)

Device update agent

Downloads manifest + firmware payload
Verifies signature and hash
Writes to the inactive slot (or staging area)
Records state for resume after reboot

Bootloader

The bootloader is the “judge”. It decides what to boot, and it’s the only component that must remain trustworthy even when application firmware is broken.

Selects slot (A or B) based on flags and health
Supports trial boots and rollback
Optionally enforces secure boot (verify signatures before boot)

2) Atomic updates: “don’t destroy the last good thing”

“Atomic” means the device is always in one of two states: old firmware fully intact or new firmware fully installed. Anything in-between must be recoverable after power loss.

The classic brick

Overwriting the currently running firmware (single-slot) without a hardened recovery path is the easiest way to brick devices. If you can afford the flash, A/B is usually the simplest safe answer.

3) Versioning: device reality beats semantic purity

Firmware versions exist for three practical reasons: targeting, debugging, and rollback/compatibility. You can use semantic versioning for human readability, but devices also need a monotonic number to compare versions safely.

A practical version tuple

human: 1.8.2 (semantic version)
build: 2026.01.09+sha.abc123 (traceability)
monotonic: rollback_index = 10802 (for comparisons + anti-rollback)

4) Rollback: “trial boot” + confirmation

Rollback isn’t “install old firmware manually.” It’s a boot policy: new firmware boots in trial mode; it must confirm itself; otherwise the bootloader reverts automatically.

What “confirm” should mean

Boot completes within a time budget
Critical peripherals initialize (radio, storage, sensors)
Device can reach the server (or passes offline health rules)
Watchdog remains happy under normal operation

What triggers rollback

Boot loop or repeated resets
Explicit “fail” flag from firmware
No confirmation after N boots / N minutes
Integrity check fails at boot (hash/signature)

5) Security chain: TLS is not the whole story

You need two layers: transport security (protect data in flight) and update authenticity (prove the firmware is yours). Authenticity is typically done by signing a manifest (or the image) with a private key; the device verifies using a public key stored in a protected location (ideally in ROM, secure element, or the bootloader region).

Step-by-step

This is a practical, production-minded path you can adapt to almost any MCU/SoC and RTOS. The goal is not to copy a “one true OTA implementation”, but to adopt the invariants that make OTA safe.

Step 1 — Define constraints and failure modes

Power: can the device lose power anytime? battery threshold to allow install?
Network: LTE/LoRa/Wi-Fi? intermittent connectivity? data caps?
Storage: enough flash for A/B? external flash available?
Risk: what does failure cost? (safety, SLA, truck rolls)
Security: do you require secure boot? key storage approach?

Step 2 — Choose flash layout (A/B is the safest default)

If you can afford it, use a dual-bank layout: keep two application images (slot A and slot B), plus a small bootloader region that never changes (or changes rarely and very carefully).

Region	Purpose	Notes
Bootloader	Selects slot, verifies integrity, handles rollback	Keep small, simple, and well-tested
Slot A	Firmware image (known-good)	Never overwrite during update
Slot B	Firmware image (staging/new)	Write new image here, then switch boot
State/flags	Boot counters, confirm flag, rollback reason	Store redundantly (two copies + CRC) if possible
Config/data	Device configuration and user data	Keep separate from firmware; version your config schema

Step 3 — Implement boot policy (trial boot + auto-rollback)

The bootloader should be able to answer: “Which slot do I boot now, and when do I revert?” Below is a minimal slot-selection state machine. It assumes the application calls boot_confirm() only after health checks pass.

<!-- Minimal A/B boot policy (illustrative). Keep real bootloaders tiny + audited. -->
typedef enum { SLOT_A = 0, SLOT_B = 1 } slot_t;

typedef struct {
  slot_t active;            // currently preferred slot
  slot_t pending;           // slot requested for trial boot
  uint8_t trial_remaining;  // allowed trial boots before rollback
  uint32_t active_crc;      // integrity for this struct
} boot_state_t;

boot_state_t st = load_boot_state_with_crc();

/* Called by update agent after writing/validating the new image in the inactive slot. */
void request_trial(slot_t new_slot) {
  st.pending = new_slot;
  st.trial_remaining = 2;  // e.g., allow 2 attempts
  st.active_crc = crc32(&st, sizeof(st) - sizeof(st.active_crc));
  save_boot_state(st);
  reboot();
}

/* Bootloader entry: decide which slot to boot. */
slot_t select_slot(void) {
  if (!crc_ok(st)) {
    // If state is corrupted, fail safe: boot the default known-good (choose a policy that fits your device).
    return SLOT_A;
  }

  if (st.trial_remaining > 0) {
    st.trial_remaining--;
    save_boot_state(st);
    return st.pending; // trial boot
  }

  return st.active; // confirmed slot
}

/* Called by application after successful boot + health checks. */
void boot_confirm(slot_t running) {
  st.active = running;
  st.pending = running;
  st.trial_remaining = 0;
  st.active_crc = crc32(&st, sizeof(st) - sizeof(st.active_crc));
  save_boot_state(st);
}

/* Optional: application can explicitly fail to force rollback on next boot. */
void boot_fail(void) {
  st.trial_remaining = 0;  // stop trial attempts
  st.active_crc = crc32(&st, sizeof(st) - sizeof(st.active_crc));
  save_boot_state(st);
  reboot();
}

What makes this brick-proof?

The update writes to the inactive slot. If power fails mid-write, the active slot still boots. If the new firmware boots but misbehaves, it never gets confirmed and the bootloader reverts.

Step 4 — Define a signed manifest (device verifies before install)

The manifest is the “contract” between server and device: what version this is, which hardware it targets, where to download it, and how to verify it. Keep it explicit so the device can refuse unsafe installs.

version: "1.8.2"
build: "2026-01-09+sha.abc123"
product: "unilab-sensor-node"
hw_compat:
  - "revA"
  - "revB"
channel: "stable"
rollback_index: 10802         # monotonic, used for comparisons/anti-rollback policies
min_bootloader: 3             # refuse if bootloader is too old for this image
payload:
  url: "https://updates.example.com/unilab-sensor-node/1.8.2/firmware.bin"
  size: 524288
  sha256: "2f2c0d2a3b8f0b4a7c1e8a5c0a2d9e6e1b4a8d0d2b7e8c4f9a0b1c2d3e4f5a6b"
  slot: "inactive"            # A/B devices write to the inactive slot
policy:
  canary_percent: 5
  install_window_utc: ["01:00", "05:00"]
  battery_min_percent: 30
  require_mains_power: false
signature:
  alg: "ed25519"
  key_id: "prod-2026-01"
  sig_b64: "BASE64_SIGNATURE_OVER_CANONICAL_MANIFEST"

Two practical tips that prevent nasty surprises:

Canonicalization: define exactly how the manifest is serialized for signing (field order, whitespace, encoding).
Key rotation: include a key_id so devices know which public key to use; plan how you’ll rotate keys safely.

Step 5 — Implement the device update agent (resume, verify, switch)

The update agent runs in the application (or a privileged service). Its job is to make updates resilient and measurable: resume downloads, verify integrity, write to staging, and request a trial boot.

Agent workflow (high-level)

Check for update (manifest)
Validate signature + compatibility
Download payload in chunks (resume supported)
Verify hash after download
Write to inactive slot (or staging)
Request trial boot + report status

What to persist across reboots

Current step (download/write/verify)
Bytes downloaded + chunk hashes (optional)
Target version + slot
Last failure reason code

Here’s a simple “check + download + verify” flow you can adapt for test rigs or CI devices. The important part isn’t the tooling— it’s that verification happens before you switch slots.

#!/usr/bin/env bash
set -euo pipefail

MANIFEST_URL="https://updates.example.com/unilab-sensor-node/latest/manifest.json"
WORKDIR="/var/lib/ota"
mkdir -p "$WORKDIR"

echo "[ota] Fetching manifest..."
curl -fsSL "$MANIFEST_URL" -o "$WORKDIR/manifest.json"

# Example: verify signature using a pinned public key (details depend on your crypto tooling).
# In production, this should run on-device and reject unsigned/unknown keys.
echo "[ota] Verifying manifest signature..."
python3 - <<'PY'
import json, base64, sys, hashlib
# Placeholder: in real life you verify Ed25519/ECDSA signature with a pinned public key.
# This script demonstrates the *shape* of the workflow.
m = json.load(open("/var/lib/ota/manifest.json"))
assert "payload" in m and "sha256" in m["payload"]
print("[ota] manifest has required fields")
PY

PAYLOAD_URL=$(python3 -c 'import json; print(json.load(open("/var/lib/ota/manifest.json"))["payload"]["url"])')
EXPECTED_SHA=$(python3 -c 'import json; print(json.load(open("/var/lib/ota/manifest.json"))["payload"]["sha256"])')

echo "[ota] Downloading payload..."
curl -fSL "$PAYLOAD_URL" -o "$WORKDIR/firmware.bin"

echo "[ota] Verifying SHA-256..."
ACTUAL_SHA=$(python3 -c 'import hashlib; d=open("/var/lib/ota/firmware.bin","rb").read(); print(hashlib.sha256(d).hexdigest())')

if [[ "$ACTUAL_SHA" != "$EXPECTED_SHA" ]]; then
  echo "[ota] ERROR: hash mismatch (expected $EXPECTED_SHA, got $ACTUAL_SHA)" >&2
  exit 2
fi

echo "[ota] OK: payload verified. Next step: write to inactive slot + request trial boot."

Step 6 — Rollout controls and telemetry (the “fleet safety net”)

OTA problems rarely show up in the first device you test. They show up when you deploy to thousands of devices across networks, temperatures, and hardware tolerances. Rollout control is how you keep a bug from becoming a disaster.

Rollout controls worth having

Canary percentage + gradual ramp
Per-hardware targeting (revA vs revB)
Pause/stop rollout without shipping new firmware
Blocklist a bad build (server refuses it)
Install windows (avoid business hours / peak power risk)

Telemetry (keep it small, keep it useful)

Current firmware version + rollback index
Update state: downloaded / written / trial / confirmed
Boot success and crash loop counters
Failure reason codes (hash mismatch, low battery, timeout, incompatible)
Optional: radio stats and download retries

OTA is also a testing discipline

Add an “OTA test matrix” to releases: power-cut during download, power-cut during write, corrupted payload, wrong hardware target, and forced reboot loops. If your system survives those, it will survive real life.

Common mistakes

These are the patterns behind “we bricked a few devices” and “we can’t reproduce what happened.” The fixes are usually architectural, not cosmetic.

Mistake 1 — Overwriting the only bootable image

Single-slot updates without a hardened recovery path are power-loss magnets.

Fix: use A/B slots, or stage to external flash then swap only after verification.
Fix: never “commit” the new image until it boots and confirms.

Mistake 2 — Trusting transport instead of authenticity

TLS protects transit, not your update supply chain. Devices still need to verify what they install.

Fix: sign manifests/images; verify on device with a pinned public key.
Fix: validate target model + hardware revision before download/write.

Mistake 3 — No resume/state machine

IoT networks drop. Without persistence, devices get stuck in “half updated” purgatory.

Fix: store update state (step + offsets) and make operations idempotent.
Fix: chunk downloads and verify after download (and optionally per chunk).

Mistake 4 — Config migrations that aren’t reversible

Firmware rolls back, but config stays “new” and breaks the old firmware.

Fix: version your config schema and support backward-compatible reads.
Fix: use a migration journal or “copy-on-write” config slots similar to A/B.

Mistake 5 — Rolling out to 100% immediately

Fleet-wide failures are rarely “one device only”. They are “one release” problems.

Fix: canary to a small percentage and watch trial/confirm rates.
Fix: add a server-side pause and a blocklist for bad builds.

Mistake 6 — Not logging why updates fail

“Update failed” isn’t actionable. You need reason codes to triage at scale.

Fix: define a small set of error codes (hash mismatch, low battery, incompatible, timeout).
Fix: report state transitions (downloaded → written → trial → confirmed).

The sneakiest failure mode

Your OTA might be “secure” and “atomic” but still fail due to flash wear or brownouts under RF transmit. If installs fail mostly on older devices, investigate storage health and power margins.

FAQ

Do I really need A/B slots for OTA updates?

Not always—but if you can afford the flash, A/B is the simplest way to make OTA updates power-loss safe. Without A/B, you must build a rock-solid recovery mode (and ensure it can recover from interrupted installs), which is harder than it sounds.

What’s the minimum security I should implement?

At minimum: a signed manifest (or signed image) verified on the device using a pinned public key, plus a hash verification of the downloaded payload. TLS is strongly recommended for transport, but authenticity should not depend on TLS alone.

How do I prevent installing the wrong firmware (wrong model / wrong hardware rev)?

Include explicit compatibility fields in the manifest (product ID, hardware revisions, minimum bootloader version), and make the device refuse anything that doesn’t match. Don’t rely on filenames or directory paths as “targeting”.

What should trigger “confirm” after a trial boot?

Confirm only after the device proves it is healthy: successful boot, critical peripherals initialized, watchdog stable, and (if applicable) network connectivity established. If your device can operate offline, define offline health checks too.

Are delta updates worth it for IoT?

Delta updates can be worth it when bandwidth is expensive (cellular) or updates are frequent, but they increase complexity. If you adopt deltas, keep a “full image” escape hatch and verify the reconstructed image exactly like a normal payload.

How do I test OTA safely before shipping to customers?

Run an OTA failure matrix: power-cut during download, power-cut during write, corrupted payload, wrong target manifest, and forced reboot loops after install. If you can’t reliably recover in the lab, you won’t recover in the field.

How do I handle fleet rollout without babysitting it?

Use staged rollout: canary to a small percentage, automatically ramp when health metrics look good, and pause automatically when failure rates exceed a threshold. The server should be able to blocklist a build without shipping a new one.

Cheatsheet

A scan-fast checklist for brick-proof OTA updates you can tape to your monitor.

Device-side safety checklist

A/B slots or an equivalent non-destructive staging mechanism
Trial boot + confirm-on-health + auto-rollback
Persist update state (resume after reboot)
Verify signature (manifest/image) + verify payload hash
Refuse wrong target (product + hardware rev + min bootloader)
Install gating (battery threshold / install window)
Reason codes for failures

Server-side rollout checklist

Channels (dev / beta / stable) and device targeting rules
Canary percentage + ramp plan
Pause/stop rollout and blocklist builds
Expose minimum required version (security fixes)
Telemetry dashboard: trial rate, confirm rate, rollback reasons
Artifact immutability (same URL always serves the same bytes)

Quick triage: when updates fail

Symptom	Likely cause	First fix to try
Devices reboot loop after update	Bad firmware or incompatible config migration	Require confirmation after health checks; add rollback; make config backward-compatible
“Downloaded but can’t install”	Not enough space / wrong slot / write failures	Check flash layout, staging, and write verification; add storage health checks
Hash/signature mismatch	Corrupt download or wrong artifact served	Make artifacts immutable; add retry; verify server caching/CDN behavior
Only some devices fail (older units)	Flash wear, power margins, hardware variance	Collect reason codes; test power/flash health; adjust install gating (battery/voltage)

Wrap-up

Safe OTA updates aren’t about fancy infrastructure—they’re about invariants: don’t destroy the last known-good image, verify what you install, treat new firmware as trial until proven healthy, and roll out slowly. If you implement A/B slots with trial/confirm/rollback, add signed manifests and hash verification, and ship with canary rollouts, you’re already ahead of most “first OTA” implementations.

Next best action

Write down your OTA failure matrix (power-cut, corrupt payload, wrong target, reboot loop) and verify your system recovers from each. That single exercise usually reveals the missing pieces faster than any code review.

Want to go deeper? Pair OTA with the messaging and power fundamentals that make IoT systems reliable: MQTT for fleet communication, BLE/Wi-Fi realities, and power optimization for devices that can’t always stay awake.

UniLab Editorial

Modern learning notes for practical builders.

OTA Updates for IoT: The Safe Way to Ship Firmware

Quickstart

1) Pick a brick-proof update strategy

2) Verify every update (always)

3) Add rollback + “trial boot”

4) Ship like a backend team: staged rollouts

Overview

What you’ll build (mentally)

Core concepts

1) OTA architecture (the three actors)

Update server

Device update agent

Bootloader

2) Atomic updates: “don’t destroy the last good thing”

3) Versioning: device reality beats semantic purity

A practical version tuple

4) Rollback: “trial boot” + confirmation

What “confirm” should mean

What triggers rollback

5) Security chain: TLS is not the whole story

Step-by-step

Step 1 — Define constraints and failure modes

Step 2 — Choose flash layout (A/B is the safest default)

Step 3 — Implement boot policy (trial boot + auto-rollback)

Step 4 — Define a signed manifest (device verifies before install)

Step 5 — Implement the device update agent (resume, verify, switch)

Agent workflow (high-level)

What to persist across reboots

Step 6 — Rollout controls and telemetry (the “fleet safety net”)

Rollout controls worth having

Telemetry (keep it small, keep it useful)

Common mistakes

Mistake 1 — Overwriting the only bootable image

Mistake 2 — Trusting transport instead of authenticity

Mistake 3 — No resume/state machine

Mistake 4 — Config migrations that aren’t reversible

Mistake 5 — Rolling out to 100% immediately

Mistake 6 — Not logging why updates fail

FAQ

Do I really need A/B slots for OTA updates?

What’s the minimum security I should implement?

How do I prevent installing the wrong firmware (wrong model / wrong hardware rev)?

What should trigger “confirm” after a trial boot?

Are delta updates worth it for IoT?

How do I test OTA safely before shipping to customers?

How do I handle fleet rollout without babysitting it?

Cheatsheet

Device-side safety checklist

Server-side rollout checklist

Quick triage: when updates fail

Wrap-up

Quiz

Related posts