Can the latest ML models beat our legacy model?

Spoiler: The shiny new algorithms didn’t run away with it—but the experiment still paid dividends.

Modern data‑science tooling has made it almost effortless to spin up sophisticated time‑series models. Auto‑ARIMA, prophet, xgboost, scikit‑learn pipelines—you name it—promise plug‑and‑play accuracy. So when our analytics team set out to refresh a 20‑year‑old forecasting workflow, I wondered:

If I spend just one afternoon wiring up SARIMAX, Prophet, and a Gradient‑Boosting Regressor (GBR), can any of them immediately outperform our crusty baselines?

Below is the story of that ultra‑low‑effort experiment: what we tested, how we measured success, and why the simplest method still came out on top.

1. Forecasting problem

Our mandate is deceptively simple: take a partial year of monthly data and predict the full‑year total. The series (we track a few thousand of them) are:

Highly seasonal (strong peaks in April–June, tapering thereafter)
Non‑stationary in level but with relatively stable seasonal ratios year over year
Subject to outlier detection—we flag anything that looks off before it flows into budgeting models

Crucially, management wanted a quick gut‑check: “Can you prove the fancy stuff is better before we invest weeks of tuning?”

2. Baseline techniques

2.1 Cumulative‑ratio estimator ("Percentage method")

If Cₘ is the cumulative total through month m and pₘ is the historical share of the annual total typically accrued by that point, our forecast is simply:

Ŷ = Cₘ / pₘ

Over two decades of data, those monthly ratios are remarkably stable—think accounting accrual calendars and statutory filing deadlines.

2.2 Annual‑run‑rate (ARR)

We assume activity is uniform for the remaining months:

Ŷ = (12 / m) * Cₘ

Surprisingly, ARR ages well when the series lacks seasonality; otherwise, it over‑ or under‑shoots spectacularly.

3. Challenger models

Model	Library	Key assumptions
SARIMAX	`statsmodels`	Additive seasonality + optional exogenous regressors (none used)
Prophet	`prophet`	Additive seasonality + piece‑wise trend, holidays auto‑generated
GBR	`sklearn`	Tree‑based, handles non‑linearity; fed with lags 1, 12 and month dummies

Why am I using SARIMAX? I wanted to set this up from the outset to use exogenous variables, so it is unnecessary for now but may be useful later...

4. Experimental design

Data window: 2014–2024 monthly observations per series
Hold‑out year: 2024 (full 12 months available for ground truth)
Cut‑off month: June 2024 (so each model sees only the first 6 months)
Metric: Mean Squared Error (MSE) of the annual prediction
Winner per series: model with lowest MSE; ties broken by simplicity

A full‑year back‑test across ~1,500 series ran end‑to‑end in about 30 minutes including generating charts.

5. Results

Overall Rank	Model	Share of wins	MSE Rank
🥇	Cumulative‑ratio	26%	1
🥈	GBR	23 %	2
🥉	Prophet	27 %	4
4	ARR	15 %	3
5	SARIMAX	9 %	5

The overall rank is slightly subjective. Prophet has more raw wins than GBR, but it has a significantly worse RMSE.

6. Why did the heuristics win?

Stable seasonal ratios Our series follow institutional calendars; the share earned by June has barely drifted in 20 years.
Low effort ≠ robust effort SARIMAX needs careful differencing, season‑length tuning, and sometimes exogenous drivers. Prophet shines with customised holiday effects. We gave them none.
Feature starvation GBR managed to eke out a close second because lag and month dummies encode seasonality directly. Give it richer features (macro factors, cross‑sectional pooling) and it might overtake.

7. Lessons & next steps

Know thy data. Diagnostics showed near‑constant seasonal shares—a green light for ratio‑based estimation.
Benchmarks first. A two‑line heuristic can set a surprisingly high bar. New models must earn their deployment.
Don’t confuse tooling effort with model complexity. Auto‑fit libraries let you train complex models quickly, but they still demand domain insight.
Where to go from here:
- Grid‑search SARIMAX seasonal parameters
- Add holiday / fiscal‑year dummy regressors to Prophet
- Build a pooled panel model to borrow strength across series (e.g., Bayesian hierarchical or global RNN)
- Automate feature‑store creation for GBR and test XGBoost/LightGBM

8. Conclusion

Was I disappointed? A little. I wanted the modern algos to obliterate the legacy model. Instead, it reminded me that:

Even simple methods built on strong domain priors can trounce a poorly tuned ML model.

The effort wasn’t wasted, though—now we have a reproducible benchmark suite and a clear path for incremental improvements. Next quarter we’ll revisit with proper feature engineering and see if the fancy stuff can finally claim gold.

Stay tuned.