Mobile Development · Networking

Mobile Networking: Retries, Timeouts, and Backoff Like a Pro

Stop ‘random’ failures with sane defaults and observability.

Reading time: ~8–12 min
Level: All levels
Updated:

“Random networking failures” usually aren’t random. Mobile apps live on shaky ground: radio wakeups, captive portals, LTE→Wi-Fi handoffs, overloaded APIs, flaky DNS, and background execution limits. The fix is boring (in a good way): sane timeouts, safe retries, exponential backoff + jitter, and observability so you can prove what’s happening.


Quickstart

If you only have 30 minutes, do these. They reduce the “works on my Wi-Fi” gap and stop storms of retries that drain battery and melt your backend.

1) Add timeouts you can explain

Timeouts are not “aggressive” or “slow” — they are UX decisions. Pick values that match your app’s screens and user patience.

  • Connect timeout: 5–10s (mobile radios need a moment)
  • Read timeout: 15–30s (depends on payload size)
  • Overall request budget: cap per user action (e.g., 25–40s max including retries)

2) Retry only what’s safe

Retrying the wrong request creates double charges, duplicate orders, and angry users.

  • Retry idempotent operations by default (GET/HEAD; some PUT/DELETE)
  • Retry network transport failures (timeouts, connection drops) and some server failures (5xx)
  • Do not retry non-idempotent POST unless you use an idempotency key

3) Use exponential backoff + jitter

Backoff prevents retry “thundering herds.” Jitter keeps thousands of devices from retrying in sync.

  • Start small (e.g., 250–500ms), then multiply (×2)
  • Add jitter (randomize delay ±20–50%)
  • Cap the delay (e.g., 8–15s) and cap attempts (e.g., 2–4 total tries)

4) Treat 429/503 as “slow down” signals

If the server says you’re rate limited or overloaded, respect it. Prefer Retry-After when available.

  • Handle HTTP 429 with backoff (and consider user messaging)
  • Handle HTTP 503 similarly (service unavailable)
  • Prefer server-provided delay (Retry-After header)

5) Log one line per request (with a request ID)

You can’t fix what you can’t see. Make every request debuggable with consistent fields.

  • Method + route (not full URL with tokens)
  • Status / error type (timeout, DNS, TLS, etc.)
  • Total duration + attempt number
  • Server request ID header (if you have one)

6) Add a “retry budget” per action

The fastest way to create a bad day is “retry forever.” Bound retries with a user-action budget.

  • 2–3 retries max for foreground actions
  • Longer schedules only for background sync (with constraints)
  • Cancel retries when the user navigates away
One guiding principle

Retries are a safety net, not a strategy. Your goal is to make one attempt usually succeed (good caching, reasonable payloads, resilient backend), and retries only cover transient failure.

Overview

Mobile networking sits at the intersection of unreliable transport and impatient humans. Unlike server-to-server calls, a mobile request might start on Wi-Fi, roam to cellular, hit a captive portal, get paused by the OS, then resume on a different IP. If your client uses “default” settings, you often get the worst of both worlds: hangs when you should fail fast, and retry storms when you should back off.

What you’ll build in this post

  • A clear timeout policy (connect/read/write + overall budget)
  • A retry decision matrix (what to retry, what never to retry)
  • An exponential backoff + jitter strategy with caps
  • Rules for idempotency (and when to use idempotency keys)
  • Minimal observability so failures become actionable, not mysterious

Sane defaults (start here)

Concern Default Adjust when…
Connect timeout 5–10s Users are on slow/captive networks (increase slightly) or actions must be instant (decrease)
Read timeout 15–30s Large downloads/streaming (increase) or small JSON APIs (decrease)
Attempts 2–4 total tries High-cost endpoints (lower) or background sync with constraints (higher but spaced out)
Backoff base 250–500ms Rate limited/overloaded server (increase base and cap)
Jitter 20–50% Many devices in sync (use full jitter)
Retryable status 408, 429, 500–504 Only if your backend semantics are safe for that endpoint
Don’t benchmark on perfect Wi-Fi

A policy that feels “too conservative” on a fiber connection is often exactly right on a subway platform. Always test on throttled/unstable networks before shipping.

Core concepts

To design reliable mobile networking, you need a few terms and a simple mental model. Think of each request as a state machine with a strict budget: you can spend time trying, waiting, and retrying—but you must stop before the UX breaks.

Timeouts

A timeout is your client deciding “this is taking too long.” Different timeout types protect you from different failure modes:

Timeout Protects against Typical mobile cause
Connect Never establishing a connection (TCP/TLS) Captive portal, bad DNS, radio waking, packet loss
Read Connection established but response stalls Overloaded server, mid-transfer drops, slow cellular
Write Uploading stalls Large uploads, poor uplink, radio switching
Overall budget “Retry forever” or death by many small waits Recursive retry chains, unbounded backoff

Retries

A retry is another attempt after a failure. The key question is not “can it succeed on try #2?” but: “Is it safe and useful to try again?”

Idempotency: the safety gate

An operation is idempotent if doing it twice has the same effect as doing it once. Many GET requests are idempotent. Many POST requests are not.

  • Retrying idempotent calls can improve reliability with low risk
  • Retrying non-idempotent calls can create duplicates unless you add an idempotency key
  • “Safe” HTTP methods: GET, HEAD, OPTIONS (typically)

Backoff + jitter

Exponential backoff spaces retries out: wait a bit, then longer, then longer—up to a cap. Jitter randomizes delays so devices don’t all retry together.

Why exponential backoff works

  • Reduces load on a struggling server
  • Stops request storms during outages
  • Buys time for transient issues (roaming, short outages)

Why jitter matters on mobile

  • Many clients share the same failure at the same time (cell tower, ISP, outage)
  • Without jitter, retries synchronize and prolong the outage
  • With jitter, retries spread out and recover faster

Retry budget and circuit breakers

A retry budget caps how much you’ll “spend” on retries per action or per time window. A circuit breaker temporarily stops calling a failing dependency to avoid wasting battery and making things worse. You don’t need a complicated implementation to get value—just a few guardrails.

The most expensive bug

Unbounded retries can: drain battery, burn user data, overload your API, and create duplicate writes. Always cap attempts and cap total time.

Step-by-step

This is a practical, production-friendly recipe. Even if your app uses a high-level client (Retrofit, Alamofire, URLSession wrappers), the same design applies: separate policy (what should happen) from plumbing (how requests are sent).

Step 1 — Define a networking policy (write it down)

Before touching code, define your policy in plain language. It should answer: how long do we wait, what do we retry, how many times, and how do we observe it.

Policy checklist

  • Per-request timeouts: connect/read/write
  • Overall budget: max time per user action (including retries)
  • Retry rules: which methods/endpoints/errors are retryable
  • Backoff rules: base, multiplier, cap, jitter
  • Cancellation: what cancels a request (screen change, app background)
  • Observability: request IDs, durations, attempt counts, error taxonomy

Step 2 — Classify failures (so “retry” isn’t a guess)

Treat failures in three buckets. This keeps your retry logic simple and safe.

Bucket Examples Typical action
Transient transport Timeouts, connection reset, DNS hiccup Retry with backoff (if idempotent)
Server overload 429, 503, some 5xx spikes Retry with longer backoff; respect Retry-After
Permanent / client 400/401/403/404, validation errors Do not retry; fix request or auth; show actionable UX
A tiny but powerful improvement

Track and log an error category (timeout, dns, tls, http_5xx, http_429, offline). Your dashboard becomes instantly useful.

Step 3 — Implement retries at one layer only

Pick one place to retry: either in your HTTP client layer (interceptor/middleware) or in a repository/service layer. Don’t do both, or you’ll multiply attempts without realizing it.

Client-layer retries (good default)

  • Centralized and consistent
  • Easy to add logging and metrics
  • Great for idempotent requests

Service-layer retries (use when)

  • Different endpoints need different policies
  • You need domain-aware logic (e.g., safe to replay only if token exists)
  • You want to tie retries to UX flows (foreground vs background)

Step 4 — Add exponential backoff + jitter (Android example)

Below is a practical OkHttp interceptor: it retries only when the request is retryable, uses exponential backoff with jitter, respects a max-attempt cap, and avoids retrying non-idempotent requests unless you explicitly allow it.

import okhttp3.Interceptor
import okhttp3.Request
import okhttp3.Response
import java.io.IOException
import kotlin.math.min
import kotlin.random.Random

class RetryBackoffInterceptor(
  private val maxAttempts: Int = 3,              // total tries including the first
  private val baseDelayMs: Long = 400,           // 250–500ms is a good start
  private val maxDelayMs: Long = 8_000,          // cap so users aren't stuck forever
  private val jitterRatio: Double = 0.3          // 0.2–0.5 typical
) : Interceptor {

  override fun intercept(chain: Interceptor.Chain): Response {
    val request = chain.request()

    // Only retry idempotent methods by default.
    // If you need to retry POST, use an idempotency key and explicitly allow it per endpoint.
    if (!isRetryableMethod(request)) {
      return chain.proceed(request)
    }

    var attempt = 1
    var lastException: IOException? = null

    while (attempt <= maxAttempts) {
      try {
        val response = chain.proceed(request)

        if (!shouldRetryResponse(response, attempt)) {
          return response
        }

        // Close response body before retrying to avoid leaks.
        response.close()
      } catch (e: IOException) {
        lastException = e
        if (!shouldRetryException(e, attempt)) break
      }

      if (attempt == maxAttempts) break

      val delay = computeBackoffDelay(attempt)
      // In production, prefer a non-blocking approach when possible.
      Thread.sleep(delay)
      attempt++
    }

    throw lastException ?: IOException("Request failed after $maxAttempts attempts")
  }

  private fun isRetryableMethod(request: Request): Boolean {
    return when (request.method.uppercase()) {
      "GET", "HEAD", "OPTIONS", "PUT", "DELETE" -> true
      else -> false
    }
  }

  private fun shouldRetryResponse(response: Response, attempt: Int): Boolean {
    val code = response.code
    if (attempt >= maxAttempts) return false
    return code == 408 || code == 429 || (code in 500..504) || code == 503
  }

  private fun shouldRetryException(e: IOException, attempt: Int): Boolean {
    // Many IOExceptions are transient (timeouts, connection resets). Avoid retry loops by capping attempts.
    return attempt < maxAttempts
  }

  private fun computeBackoffDelay(attempt: Int): Long {
    // attempt=1 is the first try; delay before attempt=2 should be baseDelayMs.
    val exp = 1L shl (attempt - 1) // 1,2,4,...
    val raw = baseDelayMs * exp
    val capped = min(raw, maxDelayMs)
    val jitter = (capped * jitterRatio).toLong()
    val randomized = capped + Random.nextLong(-jitter, jitter + 1)
    return max(0L, randomized)
  }

  private fun max(a: Long, b: Long): Long = if (a > b) a else b
}
Don’t blindly retry uploads

Retrying a large upload on cellular can burn data and battery quickly. Prefer resumable uploads, chunking, or background jobs with constraints instead of “retry in a loop.”

Step 5 — Handle idempotency for POST (the “no duplicates” rule)

If you must retry POST (payments, create operations, form submissions), use an idempotency key: a unique token that the server uses to treat repeated requests as the same operation. The client should generate a key per user action, then reuse it for all retries of that action.

Practical rules

  • Key must be stable across retries of the same action (store it in memory for that flow, or persist if needed)
  • Server must enforce idempotency per endpoint (store result by key for a window of time)
  • Only enable POST retries on endpoints that explicitly support it

Step 6 — Respect OS constraints and UX (iOS example)

On iOS, you often wrap URLSession calls with a small retry helper. The key is the same: retry only when safe, back off with jitter, and stop after a budget. Below is an async/await-style helper that retries transient failures and a small set of HTTP codes.

import Foundation

enum NetworkRetryError: Error {
  case nonRetryableStatus(Int)
  case invalidResponse
}

struct RetryPolicy {
  let maxAttempts: Int
  let baseDelay: TimeInterval   // seconds
  let maxDelay: TimeInterval
  let jitterRatio: Double
}

func requestWithRetry(
  _ request: URLRequest,
  session: URLSession = .shared,
  policy: RetryPolicy = RetryPolicy(maxAttempts: 3, baseDelay: 0.4, maxDelay: 8.0, jitterRatio: 0.3)
) async throws -> (Data, HTTPURLResponse) {

  func backoffDelay(attempt: Int) -> TimeInterval {
    let exp = pow(2.0, Double(attempt - 1)) // 1,2,4,...
    let raw = policy.baseDelay * exp
    let capped = min(raw, policy.maxDelay)
    let jitter = capped * policy.jitterRatio
    return capped + Double.random(in: -jitter...jitter)
  }

  for attempt in 1...policy.maxAttempts {
    do {
      let (data, response) = try await session.data(for: request)

      guard let http = response as? HTTPURLResponse else {
        throw NetworkRetryError.invalidResponse
      }

      let code = http.statusCode
      let retryable = (code == 408) || (code == 429) || (code == 503) || (500...504).contains(code)

      if retryable {
        if attempt == policy.maxAttempts { return (data, http) } // return last response for caller to handle
        try await Task.sleep(nanoseconds: UInt64(max(0.0, backoffDelay(attempt: attempt)) * 1_000_000_000))
        continue
      }

      // Non-retryable: return normally on 2xx; otherwise throw so caller can map to UX/auth flows.
      if (200...299).contains(code) {
        return (data, http)
      } else {
        throw NetworkRetryError.nonRetryableStatus(code)
      }
    } catch {
      // Transport error (timeouts, connection drops, etc.)
      if attempt == policy.maxAttempts { throw error }
      try await Task.sleep(nanoseconds: UInt64(max(0.0, backoffDelay(attempt: attempt)) * 1_000_000_000))
    }
  }

  throw URLError(.unknown)
}

Step 7 — Make the policy configurable (so you can tune without rewrites)

Your first defaults won’t be perfect for every endpoint. The trick is to keep a single place where policies live, so you can tune timeouts or retry caps without scattering “magic numbers” across the codebase. Even a small JSON policy file can keep teams aligned.

{
  "networkPolicy": {
    "timeouts": {
      "connectMs": 8000,
      "readMs": 25000,
      "writeMs": 15000,
      "overallBudgetMs": 35000
    },
    "retries": {
      "maxAttempts": 3,
      "retryableMethods": ["GET", "HEAD", "OPTIONS", "PUT", "DELETE"],
      "retryableStatusCodes": [408, 429, 500, 502, 503, 504],
      "retryOnTransportErrors": true,
      "doNotRetryStatusCodes": [400, 401, 403, 404, 422]
    },
    "backoff": {
      "baseDelayMs": 400,
      "multiplier": 2.0,
      "maxDelayMs": 8000,
      "jitterRatio": 0.3
    },
    "observability": {
      "logAttempts": true,
      "logTimeoutCategory": true,
      "includeRequestIdHeader": "X-Request-Id"
    }
  }
}
Foreground vs background

Foreground UX usually wants fewer retries and a tighter budget. Background sync can use longer schedules, but should run only with constraints (network available, charging, unmetered when possible) and be cancelable.

Common mistakes

Most mobile reliability issues come from a handful of repeat offenders. Here are the big ones—and what to do instead.

Mistake 1 — No timeout (or a single giant one)

Requests hang forever, UI looks frozen, and users force-close the app.

  • Fix: set connect/read/write timeouts + an overall request budget.
  • Fix: choose timeouts per endpoint category (small JSON vs large download).

Mistake 2 — Retrying non-idempotent POST

Duplicate orders, duplicate payments, duplicate messages—sometimes without any obvious client bug.

  • Fix: only retry POST with an idempotency key and server support.
  • Fix: treat “create” endpoints as high-risk and minimize attempts.

Mistake 3 — Retry storms (no backoff / no jitter)

When the server is down, clients hammer it, recovery takes longer, and battery drains fast.

  • Fix: exponential backoff + jitter + max cap.
  • Fix: consider a simple circuit breaker during outages.

Mistake 4 — Retrying on the wrong HTTP codes

401/403 won’t magically fix themselves; 404 won’t appear on retry; 422 is a validation issue.

  • Fix: retry only 408/429/503 and select 5xx codes (and only for safe endpoints).
  • Fix: route auth problems into a refresh/login flow instead of “try again.”

Mistake 5 — Not canceling requests when UI changes

You waste network, and “late” responses can overwrite newer state.

  • Fix: cancel in-flight work on screen changes; ignore stale responses with request tokens.
  • Fix: avoid shared mutable state writes without checking “is this still current?”.

Mistake 6 — Zero observability (“it failed”)

If you can’t tell timeout vs DNS vs 5xx, you can’t prioritize fixes or talk to your backend team.

  • Fix: log duration, attempt count, error category, and a request ID.
  • Fix: sample logs (don’t spam) and redact tokens/PII.
A quick “smoke test” for reliability

Turn on airplane mode mid-request, switch Wi-Fi networks, and test a flaky hotspot. If your app stays responsive and errors are actionable, you’re on the right track.

FAQ

How many retries should a mobile app use?

For foreground user actions, prefer 2–4 total attempts (including the first) with exponential backoff and jitter, and a strict overall time budget. For background sync, you can allow more attempts but space them out and run under OS constraints.

What HTTP status codes are safe to retry?

Common retry candidates are 408 (request timeout), 429 (rate limited), 503 (unavailable), and a narrow set of 5xx codes like 500/502/504. Never retry 4xx validation/auth errors unless you first fix the cause (refresh token, correct payload, etc.).

Should I retry POST requests?

Only if the endpoint is designed for it. Use an idempotency key so the server can safely deduplicate retries. Without that, retries can create duplicate writes and real-world damage.

What’s the difference between connect and read timeouts?

Connect timeout covers establishing the connection (DNS/TCP/TLS). Read timeout covers waiting for the server to send bytes after a connection exists. They fail for different reasons and deserve different default values.

Do retries make my app “more reliable” or just hide problems?

Done right, retries improve reliability for transient failures. Done wrong, they hide real bugs and overload your backend. The difference is: caps, backoff+jitter, idempotency rules, and logging that lets you see what’s happening.

How do I avoid draining battery and user data?

Keep retry caps low in the foreground, avoid retrying large uploads, cancel in-flight work when it’s no longer relevant, and run background retries only with constraints (network available, charging/unmetered when possible).

Cheatsheet

A scan-fast checklist you can copy into your project docs.

Mobile networking reliability checklist

  • Timeouts: set connect (5–10s), read (15–30s), write (10–20s), plus overall budget per action
  • Retry cap: 2–4 total attempts for foreground; never infinite
  • Retry safety: retry idempotent methods by default; POST only with idempotency key + server support
  • Retryable failures: transient transport + 408/429/503 + select 5xx
  • Backoff: exponential (×2), base 250–500ms, cap 8–15s
  • Jitter: randomize delay 20–50% (prefer full jitter if many clients)
  • Cancellation: cancel when screen changes; ignore stale responses
  • Observability: request ID, attempt #, duration, error category, status code
  • Rate limiting: respect Retry-After; treat 429 as “slow down”
  • UX: show actionable errors; provide “Try again” with context; don’t block UI indefinitely
If you can only fix one thing

Add a strict overall time budget and cap retries. It prevents the worst user experience and the worst backend load, even before you perfect your error handling.

Wrap-up

Reliable mobile networking is mostly policy: clear timeouts, safe retries, backoff with jitter, and the discipline to stop. Once you add basic observability, failures become categorized and fixable instead of “random.”

Next actions (pick one)

  • Audit your app: find endpoints without timeouts or with unbounded retries
  • Add a retry budget per user action and cancel retries when the UI moves on
  • Implement idempotency keys for any POST you might retry
  • Add request logging with request IDs and error categories (timeout/dns/tls/http_5xx/http_429)

If you want to go deeper, the related posts below pair well with this topic: offline-first sync, testing strategy, and CI pipelines (where flaky networking tests often surface first).

Quiz

Quick self-check (demo). This quiz is auto-generated for mobile / development / networking.

1) Which request is safest to retry automatically?
2) What’s the main purpose of adding jitter to backoff?
3) Which HTTP response most strongly suggests “slow down and back off”?
4) What’s the best “stop condition” for retries in a foreground user flow?