AI · Time Series

Time-Series Forecasting Without Pain: Baselines First

Avoid overfitting with strong baselines and honest evaluation.

Reading time: ~8–12 min
Level: All levels
Updated:

If your forecasting model “looks amazing,” it’s usually because the baseline was weak or the evaluation was dishonest. Start with strong baselines and time-aware backtests—then you can improve with confidence.


Quickstart: the 15-minute forecasting setup

Use this post like a checklist. You’ll get a baseline that’s hard to beat, an evaluation you can trust, and a simple path to “fancier” models only when they actually help.

1) Choose a baseline (pick one)

  • Naive: tomorrow = today (best for random-walk-like series)
  • Seasonal naive: next week’s Monday = last week’s Monday
  • Moving average: forecast = mean of last k points
  • ETS (Exponential Smoothing): trend/seasonality without heavy tuning

2) Evaluate honestly (don’t leak)

  • Split by time (never random split)
  • Use rolling backtests (multiple folds)
  • Report a baseline score first
  • Track at least one scale-free metric (e.g. MAPE/SMAPE) + one absolute (MAE)

The “pain-free” rule

Don’t ask “which model is best?” yet. First ask: “Can I beat seasonal naive in a clean backtest?” If you can’t, fancy models will waste your time.

What you’ll walk away with
  • A strong baseline ladder (easy → strong)
  • A backtesting recipe that prevents overfitting
  • Common traps and how to avoid them
  • A cheatsheet you can reuse on every new dataset

Overview: why baselines beat most “smart” models

Time series are deceptive: the past strongly predicts the future, so even flawed methods can look good. That’s why forecasting projects often fail in production—models overfit, evaluation leaks future info, and the simplest method would have performed just as well.

A good baseline answers the right question

A baseline isn’t “something to beat.” It’s the minimum performance you must justify before deploying anything complex.

Baseline When it shines What it teaches you
Naive Random-walk-ish series (stocks, noisy sensors) Whether the series is predictable at all
Seasonal naive Weekly/daily seasonality (traffic, sales) Whether you’re beating “repeat last season”
Moving average Smoothing noise, short horizons How much noise vs signal you have
ETS (Exp. smoothing) Trend + seasonality without feature engineering What “classical forecasting” can do with almost no effort
The fastest way to avoid overfitting

If your model can’t beat seasonal naive on multiple rolling backtests, your pipeline is usually the problem (split, leakage, features), not the algorithm.

Once baselines are set, improving becomes straightforward: you either add real signal (useful features), reduce noise (smoothing/aggregation), or pick a model that matches the series structure (trend, seasonality, events, holidays, multiple series).

Core concepts: the few things that matter most

1) Forecast horizon

The horizon is how far ahead you predict (next hour, next day, next 30 days). A method that’s great for one-step ahead can fail at longer horizons.

Rule of thumb

Short horizons often work with simple baselines. Long horizons usually need stronger structure: seasonality, trend, known events, or multiple related series.

2) Seasonality and calendar structure

Seasonality means patterns repeat at a fixed period (daily, weekly, yearly). Many real-world datasets have it. If you ignore seasonality, your model can “learn it” in training and still fail when the calendar shifts.

Seasonal naive baseline

If your season is weekly:

y_hat[t] = y[t - 7]

This is absurdly strong for many business series.

Common seasons

  • Hourly: 24 (daily cycle)
  • Daily: 7 (weekly pattern)
  • Monthly: 12 (yearly season)
  • Retail: holidays + promotions (non-fixed “events”)

3) Leakage (the silent killer)

Leakage is when your features or preprocessing use information from the future. It can create models that look perfect in validation and collapse in production.

Leakage examples (very common)

  • Random train/test split on a time series
  • Scaling (mean/std) fit on all data, then split
  • Rolling features computed with future points included
  • Using “future-known” fields that aren’t actually known at forecast time
If your score is “too good,” assume leakage first

Especially if a complex model suddenly beats the baseline by a huge margin. Clean your splits and re-run the backtest before celebrating.

4) Backtesting (rolling evaluation)

Backtesting means evaluating the model on several time windows, not just one split. This protects you from picking a model that got lucky on a single period.

Rolling-origin sketch

Train on early history → test on the next block → move forward → repeat. Report average + spread (min/max or std).

Fold Train window Test window Goal
1 Jan → Jun Jul First “future” check
2 Jan → Jul Aug Stability across months
3 Jan → Aug Sep Robustness under drift

Step-by-step: a baseline-first forecasting workflow

This workflow scales from “one series in a CSV” to “many products across stores.” The ordering matters: each step prevents a class of mistakes.

Step 1 — Define the job (what are we predicting?)

  • Target: what value are you forecasting? (sales, traffic, load)
  • Granularity: hourly, daily, weekly?
  • Horizon: 1-step, 7 days ahead, 30 days ahead?
  • Business loss: is over-forecasting worse than under-forecasting?

Step 2 — Make a time split that matches reality

Decide how your model will be used, then simulate it. If you forecast daily and retrain weekly, your backtest should reflect that cadence.

Simple setup (good start)

  • Train: first 70–85% of time
  • Validation: next 10–15%
  • Test: final 10–15% (never touch until the end)

Better setup (recommended)

  • Rolling backtest with 3–10 folds
  • Report mean + spread of metrics
  • Keep a final “holdout” test period

Step 3 — Implement the baseline ladder

Build baselines from weakest to strongest. Stop when you hit “strong enough to be annoying to beat.” For many datasets, seasonal naive or ETS is that line.

Baseline ladder (copy/paste mental model)

Level Baseline Formula / idea Why it’s useful
0 Mean Predict overall mean Sanity check
1 Naive ŷ[t]=y[t-1] Hard baseline for noisy series
2 Seasonal naive ŷ[t]=y[t-s] Beats most “fancy” models when seasonality exists
3 Moving average Mean of last k Noise reduction
4 ETS Smoothing + trend + season Strong classical model with minimal tuning

Step 4 — Pick metrics that match the goal

Don’t overthink metrics, but do pick at least two: one absolute error and one scale-free percentage-like score.

Good defaults

  • MAE: easy to understand
  • RMSE: penalizes big misses
  • SMAPE: stable-ish percentage error

When to avoid MAPE

MAPE explodes near zero. If your target can be 0 or tiny (demand, clicks), prefer SMAPE or MAE.

Step 5 — Minimal baseline code (Python)

This is intentionally small. The goal is not “perfect library code,” it’s a clean reference you can trust.

# Minimal baselines + rolling backtest (pure Python-ish pseudocode)
# Replace arrays with pandas Series if you like.

def naive_forecast(y_train, horizon):
    # Predict last observed value for all future steps
    last = y_train[-1]
    return [last] * horizon

def seasonal_naive_forecast(y_train, horizon, season):
    # Predict value from one season ago; repeats if horizon > season
    out = []
    for h in range(1, horizon + 1):
        out.append(y_train[-season + ((h - 1) % season)])
    return out

def moving_average_forecast(y_train, horizon, k=7):
    k = min(k, len(y_train))
    avg = sum(y_train[-k:]) / k
    return [avg] * horizon

def mae(y_true, y_pred):
    return sum(abs(a - b) for a, b in zip(y_true, y_pred)) / len(y_true)

def rolling_backtest(y, start, horizon, step, forecast_fn):
    # y: full series list
    # start: first index where you begin forecasting (e.g. 70% of data)
    scores = []
    t = start
    while t + horizon <= len(y):
        train = y[:t]
        test = y[t:t+horizon]
        pred = forecast_fn(train, horizon)
        scores.append(mae(test, pred))
        t += step
    return scores

# Example usage:
# scores = rolling_backtest(y, start=200, horizon=14, step=7, forecast_fn=lambda tr, h: seasonal_naive_forecast(tr, h, season=7))
# print("MAE mean:", sum(scores)/len(scores), "min/max:", min(scores), max(scores))
One powerful habit

Always log: baseline score, model score, and the % improvement. If the improvement is tiny and unstable across folds, don’t ship complexity.

Step 6 — Only then: upgrades that usually work

Upgrade the data (often best ROI)

  • Fix missing values + outliers
  • Aggregate noisy series (hourly → daily) if the decision allows
  • Add known future events (holidays, promotions)
  • Use multiple related series (hierarchies, groups)

Upgrade the model (when baselines are beaten)

  • ETS tuning (trend/season type)
  • ARIMA/SARIMA for structured autocorrelation
  • Gradient boosting with lag features (tabularized TS)
  • Deep learning only when you have scale + complexity

A simple decision gate

Move to a more complex model only if it: (1) beats seasonal naive on most folds, (2) stays better on the final holdout, and (3) remains stable when you retrain later.

Common mistakes (and how to fix them)

These are the exact traps that make forecasting feel “painful.” Fixing them usually improves results more than switching to a new model.

Mistake 1 — Weak baseline (or no baseline)

If your baseline is “predict the mean,” everything looks impressive.

  • Fix: always include seasonal naive (when seasonality exists).
  • Fix: compare against ETS for a strong classical reference.

Mistake 2 — Random split on time series

Random split leaks information because the future becomes “training data.”

  • Fix: split by time (train older, test newer).
  • Fix: use rolling backtests to avoid “one lucky split.”

Mistake 3 — Feature leakage via preprocessing

Scaling, imputation, and rolling features can leak future info.

  • Fix: fit preprocessing on training folds only.
  • Fix: compute rolling features using past-only windows.

Mistake 4 — Chasing a single metric

A model can “win” on one metric and still be worse for the business.

  • Fix: track MAE + a scale-free metric (SMAPE).
  • Fix: check errors on important segments (weekends, holidays, peaks).
The fastest debugging move

Plot residuals (actual − forecast) over time. If errors explode during certain weeks, your model is missing an event/holiday/seasonal structure—add it explicitly.

FAQ

What’s the best baseline for forecasting?

If you have clear seasonality (daily/weekly/yearly), start with seasonal naive. It’s simple, brutally strong, and exposes whether your “smart model” adds real value. If seasonality is weak or unknown, use naive + moving average as quick references.

What is ETS, and why is it such a good baseline?

ETS (Exponential Smoothing) is a family of classical methods that model level, trend, and seasonality with smoothing. It’s popular as a baseline because it often performs surprisingly well with minimal tuning.

How many backtest folds do I need?

For most projects, 3–10 folds is enough. Use more folds if the series is unstable (lots of drift) or if you need high confidence for deployment. Always keep a final holdout period that you only evaluate once.

Why does my model do well at 1-day ahead but fails at 30-days ahead?

Longer horizons require stronger structure. At 1-day ahead, yesterday is often a great predictor. At 30-days ahead, you need seasonality, trend stability, and known future factors (calendar/events), otherwise forecasts become “guessy.”

Should I use deep learning for time series?

Only if the problem justifies it: lots of data, many related series, complex patterns, or a need to model nonlinear relationships with many features. For many forecasting tasks, strong baselines + classical methods + simple ML are easier to maintain and just as accurate.

Cheatsheet: baseline-first forecasting (copy this)

Baseline ladder

  • Mean (sanity)
  • Naive: ŷ[t]=y[t-1]
  • Seasonal naive: ŷ[t]=y[t-s]
  • Moving average: mean of last k
  • ETS: level + trend + season (strong classical)

Evaluation checklist

  • Split by time (never random)
  • Rolling backtest (3–10 folds)
  • Report mean + spread
  • Use MAE + SMAPE (good defaults)
  • Keep a final holdout test set

Leakage sniff test

  • If score is “shockingly good,” assume leakage first
  • Ensure preprocessing is fit on train folds only
  • Ensure rolling features use past-only windows
  • Ensure “known future” features are truly known at forecast time
Baseline success criteria

A model is worth keeping only if it beats seasonal naive on most folds, stays better on the holdout, and remains stable across retrains. Anything else is likely overfit or too fragile for production.

Wrap-up: make forecasting boring (that’s the win)

The goal isn’t to build the fanciest model—it’s to make accurate forecasts you can trust. Start with strong baselines (especially seasonal naive), evaluate with time-aware backtests, and only then move up the complexity ladder.

Your next step
  • Pick your season length (s) and run seasonal naive.
  • Backtest 5 folds and record MAE + SMAPE.
  • Try ETS and compare.
  • Only if you beat both: add features or move to ML.

Quiz

Quick self-check. This quiz is here to test if you learned the baseline-first mindset.

1) What’s the best first step in a new forecasting project?
2) Seasonal naive forecasting means…
3) Which evaluation approach is most trustworthy for time series?
4) If a complex model suddenly gets a “perfect” score, what should you suspect first?