Evaluation¶

Work in progress — playground deployment

This page documents the evaluation framework currently deployed on the playground. It is a draft and may change before the production rollout — treat figures and examples here as preliminary.

Every Predico forecast goes through three scoring layers, applied in order:

Per-submission contribution: score one forecast value against one observation.
Per-timestamp aggregation: combine contributions when several forecasts cover the same target timestamp.
Daily score: collapse the day's per-timestamp values into a single number.

Each layer reduces dimensionality. Layer 1 produces one contribution per (forecast, observation) pair. Layer 2 collapses contributions to one value per target timestamp. Layer 3 collapses the day's timestamp values into a single number: RMSE for Q50, MWI for Q10/Q90.

In every layer, lower is better.

Looking for framework-specific operational details? Jump straight to:

Intraday Evaluation

Slot model, intraday forward-fill, cross-horizon substitution, and full-coverage qualification.
Day-Ahead & Extended Evaluation

Per-target-day scoring and mandatory session-level compliance: submit to every opened session of the target day.

Layer 1 - Per-submission contribution¶

A contribution is the score of one forecast value against one observation. It depends on the variable being forecast.

Q50 (deterministic forecast)¶

For a Q50 forecast f and observation o, the contribution is the squared residual:

contribution = (o − f)²

The contribution is the residual squared. The square root that turns this into RMSE is taken at Layer 3, not here.

Q10 / Q90 (probabilistic forecast)¶

The Q10 and Q90 quantiles together define an 80% prediction interval. Their joint contribution is the Winkler interval value, with α = 0.2:

contribution = (q90 − q10) + (2/α) · max(0, q10 − o) + (2/α) · max(0, o − q90)

Three regimes:

q10 ≤ o ≤ q90 — observation falls inside the interval. Contribution = interval width only. Narrow + correct → low score.
o < q10 — observation falls below the lower bound. Contribution = width plus lower-tail penalty.
o > q90 — observation falls above the upper bound. Contribution = width plus upper-tail penalty.

A narrow interval that brackets the observation scores best. Excursions on either side are penalised in proportion to how far they fall outside the interval.

Layer 2 - Per-timestamp aggregation¶

Multiple submissions may cover the same target timestamp t. Every eligible submission whose forecast window contains t contributes one Layer 1 value. Layer 2 combines them into a single per-timestamp value with a plain arithmetic mean. Each submission counts the same, regardless of lead time.

Each slot is one submission whose forecast window contains `t`. All slots contribute equally to the per-timestamp value, regardless of lead time.

The final score is the arithmetic mean of contributions and not the forecasts themselves

One must first calculate the contribution (as per Layer 1) of each forecast covering the target timestamp, then average those contributions to get the per-timestamp value. Averaging forecasts first and then calculating contribution will yield a different result.

For full operational rules: session-level compliance, missed-session handling, and qualification edge cases, see Intraday Evaluation and Day-Ahead & Extended Evaluation.

Layer 3 — Daily score¶

Once each target timestamp t of the day has its Layer 2 value per_t, the day score is one of:

Variable	Layer 2 output (`per_t`)	Daily score
Q50	mean of squared residuals at `t`	`sqrt(mean_t(per_t))` — root mean squared residuals (RMSE)
Q10 / Q90	mean Winkler value at `t`	`mean_t(per_t)` — MWI

The square root is applied exactly once

For Q50, take the square root once, on the day's mean of per-timestamp values — not per timestamp. Rooting each per_t first and then averaging would, by Jensen's inequality, systematically underestimate the true RMSE and understate the large-error timestamps the metric is meant to penalise.

The day's per-timestamp values are averaged into a single number. For Q50, the square root is taken once, turning mean squared residuals into RMSE. For Q10/Q90, the daily Mean Winkler Interval (MWI) is the arithmetic mean directly.

Observation gaps don't count against you

If the buyer's measurements are incomplete for a target day, the day is skipped for everyone across every horizon. No score, no penalty, no missed-day mark. Buyer-side data hiccups never propagate into your record.

Score recalculation¶

Because data providers sometimes revise measurements, scores are recalculated on a rolling schedule:

Every hour — score the previous days forecasts.
Through day 7 of the next month — recompute scores for the current and previous months.
After day 7 — recompute the current month only. Scores for the previous month are locked in, even if data providers later revise their measurements for that month. This is a hard cutoff — no exceptions.