Evaluation¶
Work in progress — playground deployment
This page documents the evaluation framework currently deployed on the playground. It is a draft and may change before the production rollout — treat figures and examples here as preliminary.
Every Predico forecast goes through three scoring layers, applied in order:
- Per-submission contribution: score one forecast value against one observation.
- Per-timestamp aggregation: combine contributions when several forecasts cover the same target timestamp.
- Daily score: collapse the day's per-timestamp values into a single number.
In every layer, lower is better.
Looking for framework-specific operational details? Jump straight to:
-
Slot model, intraday forward-fill, cross-horizon substitution, and full-coverage qualification.
-
Day-Ahead & Extended Evaluation
Per-target-day scoring and mandatory session-level compliance: submit to every opened session of the target day.
Layer 1 - Per-submission contribution¶
A contribution is the score of one forecast value against one observation. It depends on the variable being forecast.
Q50 (deterministic forecast)¶
For a Q50 forecast f and observation o, the contribution is the squared residual:
Q10 / Q90 (probabilistic forecast)¶
The Q10 and Q90 quantiles together define an 80% prediction interval. Their joint contribution is the Winkler interval value, with α = 0.2:
Three regimes:
q10 ≤ o ≤ q90— observation falls inside the interval. Contribution = interval width only. Narrow + correct → low score.o < q10— observation falls below the lower bound. Contribution = width plus lower-tail penalty.o > q90— observation falls above the upper bound. Contribution = width plus upper-tail penalty.
Layer 2 - Per-timestamp aggregation¶
Multiple submissions may cover the same target timestamp t. Every eligible submission whose forecast window contains t contributes one Layer 1 value.
Layer 2 combines them into a single per-timestamp value with a plain arithmetic mean. Each submission counts the same, regardless of lead time.
The final score is the arithmetic mean of contributions and not the forecasts themselves
One must first calculate the contribution (as per Layer 1) of each forecast covering the target timestamp, then average those contributions to get the per-timestamp value. Averaging forecasts first and then calculating contribution will yield a different result.
For full operational rules: session-level compliance, missed-session handling, and qualification edge cases, see Intraday Evaluation and Day-Ahead & Extended Evaluation.
Layer 3 — Daily score¶
Once each target timestamp t of the day has its Layer 2 value per_t, the day score is one of:
| Variable | Layer 2 output (per_t) |
Daily score |
|---|---|---|
| Q50 | mean of squared residuals at t |
sqrt(mean_t(per_t)) — root mean squared residuals (RMSE) |
| Q10 / Q90 | mean Winkler value at t |
mean_t(per_t) — MWI |
The square root is applied exactly once
For Q50, take the square root once, on the day's mean of per-timestamp values — not per timestamp. Rooting each per_t first and then averaging would, by Jensen's inequality, systematically underestimate the true RMSE and understate the large-error timestamps the metric is meant to penalise.
Observation gaps don't count against you
If the buyer's measurements are incomplete for a target day, the day is skipped for everyone across every horizon. No score, no penalty, no missed-day mark. Buyer-side data hiccups never propagate into your record.
Score recalculation¶
Because data providers sometimes revise measurements, scores are recalculated on a rolling schedule:
- Every hour — score the previous days forecasts.
- Through day 7 of the next month — recompute scores for the current and previous months.
- After day 7 — recompute the current month only. Scores for the previous month are locked in, even if data providers later revise their measurements for that month. This is a hard cutoff — no exceptions.