If your forecasting model “looks amazing,” it’s usually because the baseline was weak or the evaluation was dishonest. Start with strong baselines and time-aware backtests—then you can improve with confidence.
Quickstart: the 15-minute forecasting setup
Use this post like a checklist. You’ll get a baseline that’s hard to beat, an evaluation you can trust, and a simple path to “fancier” models only when they actually help.
1) Choose a baseline (pick one)
- Naive: tomorrow = today (best for random-walk-like series)
- Seasonal naive: next week’s Monday = last week’s Monday
- Moving average: forecast = mean of last
kpoints - ETS (Exponential Smoothing): trend/seasonality without heavy tuning
2) Evaluate honestly (don’t leak)
- Split by time (never random split)
- Use rolling backtests (multiple folds)
- Report a baseline score first
- Track at least one scale-free metric (e.g. MAPE/SMAPE) + one absolute (MAE)
The “pain-free” rule
Don’t ask “which model is best?” yet. First ask: “Can I beat seasonal naive in a clean backtest?” If you can’t, fancy models will waste your time.
- A strong baseline ladder (easy → strong)
- A backtesting recipe that prevents overfitting
- Common traps and how to avoid them
- A cheatsheet you can reuse on every new dataset
Overview: why baselines beat most “smart” models
Time series are deceptive: the past strongly predicts the future, so even flawed methods can look good. That’s why forecasting projects often fail in production—models overfit, evaluation leaks future info, and the simplest method would have performed just as well.
A good baseline answers the right question
A baseline isn’t “something to beat.” It’s the minimum performance you must justify before deploying anything complex.
| Baseline | When it shines | What it teaches you |
|---|---|---|
| Naive | Random-walk-ish series (stocks, noisy sensors) | Whether the series is predictable at all |
| Seasonal naive | Weekly/daily seasonality (traffic, sales) | Whether you’re beating “repeat last season” |
| Moving average | Smoothing noise, short horizons | How much noise vs signal you have |
| ETS (Exp. smoothing) | Trend + seasonality without feature engineering | What “classical forecasting” can do with almost no effort |
If your model can’t beat seasonal naive on multiple rolling backtests, your pipeline is usually the problem (split, leakage, features), not the algorithm.
Once baselines are set, improving becomes straightforward: you either add real signal (useful features), reduce noise (smoothing/aggregation), or pick a model that matches the series structure (trend, seasonality, events, holidays, multiple series).
Core concepts: the few things that matter most
1) Forecast horizon
The horizon is how far ahead you predict (next hour, next day, next 30 days). A method that’s great for one-step ahead can fail at longer horizons.
Rule of thumb
Short horizons often work with simple baselines. Long horizons usually need stronger structure: seasonality, trend, known events, or multiple related series.
2) Seasonality and calendar structure
Seasonality means patterns repeat at a fixed period (daily, weekly, yearly). Many real-world datasets have it. If you ignore seasonality, your model can “learn it” in training and still fail when the calendar shifts.
Seasonal naive baseline
If your season is weekly:
y_hat[t] = y[t - 7]
This is absurdly strong for many business series.
Common seasons
- Hourly: 24 (daily cycle)
- Daily: 7 (weekly pattern)
- Monthly: 12 (yearly season)
- Retail: holidays + promotions (non-fixed “events”)
3) Leakage (the silent killer)
Leakage is when your features or preprocessing use information from the future. It can create models that look perfect in validation and collapse in production.
Leakage examples (very common)
- Random train/test split on a time series
- Scaling (mean/std) fit on all data, then split
- Rolling features computed with future points included
- Using “future-known” fields that aren’t actually known at forecast time
Especially if a complex model suddenly beats the baseline by a huge margin. Clean your splits and re-run the backtest before celebrating.
4) Backtesting (rolling evaluation)
Backtesting means evaluating the model on several time windows, not just one split. This protects you from picking a model that got lucky on a single period.
Rolling-origin sketch
Train on early history → test on the next block → move forward → repeat. Report average + spread (min/max or std).
| Fold | Train window | Test window | Goal |
|---|---|---|---|
| 1 | Jan → Jun | Jul | First “future” check |
| 2 | Jan → Jul | Aug | Stability across months |
| 3 | Jan → Aug | Sep | Robustness under drift |
Step-by-step: a baseline-first forecasting workflow
This workflow scales from “one series in a CSV” to “many products across stores.” The ordering matters: each step prevents a class of mistakes.
Step 1 — Define the job (what are we predicting?)
- Target: what value are you forecasting? (sales, traffic, load)
- Granularity: hourly, daily, weekly?
- Horizon: 1-step, 7 days ahead, 30 days ahead?
- Business loss: is over-forecasting worse than under-forecasting?
Step 2 — Make a time split that matches reality
Decide how your model will be used, then simulate it. If you forecast daily and retrain weekly, your backtest should reflect that cadence.
Simple setup (good start)
- Train: first 70–85% of time
- Validation: next 10–15%
- Test: final 10–15% (never touch until the end)
Better setup (recommended)
- Rolling backtest with 3–10 folds
- Report mean + spread of metrics
- Keep a final “holdout” test period
Step 3 — Implement the baseline ladder
Build baselines from weakest to strongest. Stop when you hit “strong enough to be annoying to beat.” For many datasets, seasonal naive or ETS is that line.
Baseline ladder (copy/paste mental model)
| Level | Baseline | Formula / idea | Why it’s useful |
|---|---|---|---|
| 0 | Mean | Predict overall mean | Sanity check |
| 1 | Naive | ŷ[t]=y[t-1] |
Hard baseline for noisy series |
| 2 | Seasonal naive | ŷ[t]=y[t-s] |
Beats most “fancy” models when seasonality exists |
| 3 | Moving average | Mean of last k |
Noise reduction |
| 4 | ETS | Smoothing + trend + season | Strong classical model with minimal tuning |
Step 4 — Pick metrics that match the goal
Don’t overthink metrics, but do pick at least two: one absolute error and one scale-free percentage-like score.
Good defaults
- MAE: easy to understand
- RMSE: penalizes big misses
- SMAPE: stable-ish percentage error
When to avoid MAPE
MAPE explodes near zero. If your target can be 0 or tiny (demand, clicks), prefer SMAPE or MAE.
Step 5 — Minimal baseline code (Python)
This is intentionally small. The goal is not “perfect library code,” it’s a clean reference you can trust.
# Minimal baselines + rolling backtest (pure Python-ish pseudocode)
# Replace arrays with pandas Series if you like.
def naive_forecast(y_train, horizon):
# Predict last observed value for all future steps
last = y_train[-1]
return [last] * horizon
def seasonal_naive_forecast(y_train, horizon, season):
# Predict value from one season ago; repeats if horizon > season
out = []
for h in range(1, horizon + 1):
out.append(y_train[-season + ((h - 1) % season)])
return out
def moving_average_forecast(y_train, horizon, k=7):
k = min(k, len(y_train))
avg = sum(y_train[-k:]) / k
return [avg] * horizon
def mae(y_true, y_pred):
return sum(abs(a - b) for a, b in zip(y_true, y_pred)) / len(y_true)
def rolling_backtest(y, start, horizon, step, forecast_fn):
# y: full series list
# start: first index where you begin forecasting (e.g. 70% of data)
scores = []
t = start
while t + horizon <= len(y):
train = y[:t]
test = y[t:t+horizon]
pred = forecast_fn(train, horizon)
scores.append(mae(test, pred))
t += step
return scores
# Example usage:
# scores = rolling_backtest(y, start=200, horizon=14, step=7, forecast_fn=lambda tr, h: seasonal_naive_forecast(tr, h, season=7))
# print("MAE mean:", sum(scores)/len(scores), "min/max:", min(scores), max(scores))
Always log: baseline score, model score, and the % improvement. If the improvement is tiny and unstable across folds, don’t ship complexity.
Step 6 — Only then: upgrades that usually work
Upgrade the data (often best ROI)
- Fix missing values + outliers
- Aggregate noisy series (hourly → daily) if the decision allows
- Add known future events (holidays, promotions)
- Use multiple related series (hierarchies, groups)
Upgrade the model (when baselines are beaten)
- ETS tuning (trend/season type)
- ARIMA/SARIMA for structured autocorrelation
- Gradient boosting with lag features (tabularized TS)
- Deep learning only when you have scale + complexity
A simple decision gate
Move to a more complex model only if it: (1) beats seasonal naive on most folds, (2) stays better on the final holdout, and (3) remains stable when you retrain later.
Common mistakes (and how to fix them)
These are the exact traps that make forecasting feel “painful.” Fixing them usually improves results more than switching to a new model.
Mistake 1 — Weak baseline (or no baseline)
If your baseline is “predict the mean,” everything looks impressive.
- Fix: always include seasonal naive (when seasonality exists).
- Fix: compare against ETS for a strong classical reference.
Mistake 2 — Random split on time series
Random split leaks information because the future becomes “training data.”
- Fix: split by time (train older, test newer).
- Fix: use rolling backtests to avoid “one lucky split.”
Mistake 3 — Feature leakage via preprocessing
Scaling, imputation, and rolling features can leak future info.
- Fix: fit preprocessing on training folds only.
- Fix: compute rolling features using past-only windows.
Mistake 4 — Chasing a single metric
A model can “win” on one metric and still be worse for the business.
- Fix: track MAE + a scale-free metric (SMAPE).
- Fix: check errors on important segments (weekends, holidays, peaks).
Plot residuals (actual − forecast) over time. If errors explode during certain weeks, your model is missing an event/holiday/seasonal structure—add it explicitly.
FAQ
What’s the best baseline for forecasting?
If you have clear seasonality (daily/weekly/yearly), start with seasonal naive. It’s simple, brutally strong, and exposes whether your “smart model” adds real value. If seasonality is weak or unknown, use naive + moving average as quick references.
What is ETS, and why is it such a good baseline?
ETS (Exponential Smoothing) is a family of classical methods that model level, trend, and seasonality with smoothing. It’s popular as a baseline because it often performs surprisingly well with minimal tuning.
How many backtest folds do I need?
For most projects, 3–10 folds is enough. Use more folds if the series is unstable (lots of drift) or if you need high confidence for deployment. Always keep a final holdout period that you only evaluate once.
Why does my model do well at 1-day ahead but fails at 30-days ahead?
Longer horizons require stronger structure. At 1-day ahead, yesterday is often a great predictor. At 30-days ahead, you need seasonality, trend stability, and known future factors (calendar/events), otherwise forecasts become “guessy.”
Should I use deep learning for time series?
Only if the problem justifies it: lots of data, many related series, complex patterns, or a need to model nonlinear relationships with many features. For many forecasting tasks, strong baselines + classical methods + simple ML are easier to maintain and just as accurate.
Cheatsheet: baseline-first forecasting (copy this)
Baseline ladder
- Mean (sanity)
- Naive:
ŷ[t]=y[t-1] - Seasonal naive:
ŷ[t]=y[t-s] - Moving average: mean of last
k - ETS: level + trend + season (strong classical)
Evaluation checklist
- Split by time (never random)
- Rolling backtest (3–10 folds)
- Report mean + spread
- Use MAE + SMAPE (good defaults)
- Keep a final holdout test set
Leakage sniff test
- If score is “shockingly good,” assume leakage first
- Ensure preprocessing is fit on train folds only
- Ensure rolling features use past-only windows
- Ensure “known future” features are truly known at forecast time
A model is worth keeping only if it beats seasonal naive on most folds, stays better on the holdout, and remains stable across retrains. Anything else is likely overfit or too fragile for production.
Wrap-up: make forecasting boring (that’s the win)
The goal isn’t to build the fanciest model—it’s to make accurate forecasts you can trust. Start with strong baselines (especially seasonal naive), evaluate with time-aware backtests, and only then move up the complexity ladder.
- Pick your season length (
s) and run seasonal naive. - Backtest 5 folds and record MAE + SMAPE.
- Try ETS and compare.
- Only if you beat both: add features or move to ML.
Quiz
Quick self-check. This quiz is here to test if you learned the baseline-first mindset.