Skip to content

Evaluation

Work in progress — playground deployment

This page documents the evaluation framework currently deployed on the playground. It is a draft and may change before the production rollout — treat figures and examples here as preliminary.

Every Predico forecast goes through three scoring layers, applied in order:

  1. Per-submission contribution: score one forecast value against one observation.
  2. Per-timestamp aggregation: combine contributions when several forecasts cover the same target timestamp.
  3. Daily score: collapse the day's per-timestamp values into a single number.
LAYER 1 Per-submission contribution f o residual one score per (forecast, observation) pair aggregate per t LAYER 2 Per-timestamp aggregation one value per target timestamp collapse across day LAYER 3 Daily score RMSE / MWI one number per day
Each layer reduces dimensionality. Layer 1 produces one contribution per (forecast, observation) pair. Layer 2 collapses contributions to one value per target timestamp. Layer 3 collapses the day's timestamp values into a single number: RMSE for Q50, MWI for Q10/Q90.

In every layer, lower is better.

Looking for framework-specific operational details? Jump straight to:

  • Intraday Evaluation


    Slot model, intraday forward-fill, cross-horizon substitution, and full-coverage qualification.

  • Day-Ahead & Extended Evaluation


    Per-target-day scoring and mandatory session-level compliance: submit to every opened session of the target day.

Layer 1 - Per-submission contribution

A contribution is the score of one forecast value against one observation. It depends on the variable being forecast.

Q50 (deterministic forecast)

For a Q50 forecast f and observation o, the contribution is the squared residual:

contribution = (o − f)²
Q50 — squared residual forecast f observation o residual = o − f
The contribution is the residual squared. The square root that turns this into RMSE is taken at Layer 3, not here.

Q10 / Q90 (probabilistic forecast)

The Q10 and Q90 quantiles together define an 80% prediction interval. Their joint contribution is the Winkler interval value, with α = 0.2:

contribution = (q90 − q10) + (2/α) · max(0, q10 − o) + (2/α) · max(0, o − q90)

Three regimes:

  • q10 ≤ o ≤ q90 — observation falls inside the interval. Contribution = interval width only. Narrow + correct → low score.
  • o < q10 — observation falls below the lower bound. Contribution = width plus lower-tail penalty.
  • o > q90 — observation falls above the upper bound. Contribution = width plus upper-tail penalty.
Q10/Q90 — Winkler interval value contribution = (q90 − q10) + (2/α)·max(0, q10 − o) + (2/α)·max(0, o − q90) q10 q90 o inside → width only q10 q90 o below → + lower-tail penalty q10 q90 o above → + upper-tail penalty
A narrow interval that brackets the observation scores best. Excursions on either side are penalised in proportion to how far they fall outside the interval.

Layer 2 - Per-timestamp aggregation

Multiple submissions may cover the same target timestamp t. Every eligible submission whose forecast window contains t contributes one Layer 1 value. Layer 2 combines them into a single per-timestamp value with a plain arithmetic mean. Each submission counts the same, regardless of lead time.

Per-timestamp value — arithmetic mean N filled slots cover target timestamp t · all weighted equally slot 1 slot 1 contribution w = 1 slot 2 slot 2 contribution w = 1 slot N slot N contribution w = 1 ↓ value at t = (c₁ + c₂ + … + c_N) / N
Each slot is one submission whose forecast window contains `t`. All slots contribute equally to the per-timestamp value, regardless of lead time.

The final score is the arithmetic mean of contributions and not the forecasts themselves

One must first calculate the contribution (as per Layer 1) of each forecast covering the target timestamp, then average those contributions to get the per-timestamp value. Averaging forecasts first and then calculating contribution will yield a different result.

For full operational rules: session-level compliance, missed-session handling, and qualification edge cases, see Intraday Evaluation and Day-Ahead & Extended Evaluation.

Layer 3 — Daily score

Once each target timestamp t of the day has its Layer 2 value per_t, the day score is one of:

Variable Layer 2 output (per_t) Daily score
Q50 mean of squared residuals at t sqrt(mean_t(per_t)) — root mean squared residuals (RMSE)
Q10 / Q90 mean Winkler value at t mean_t(per_t) — MWI

The square root is applied exactly once

For Q50, take the square root once, on the day's mean of per-timestamp values — not per timestamp. Rooting each per_t first and then averaging would, by Jensen's inequality, systematically underestimate the true RMSE and understate the large-error timestamps the metric is meant to penalise.

Daily score — collapse across the day average all per-timestamp values of the day, then √ for Q50 per_t₁ per_t₂ per_t₃ per_t₄ per_t₅ per_t₆ per_t₇ per_t₈ per_tₙ one Layer 2 value per target timestamp t of the day arithmetic mean mean_t(per_t) square root √ as-is RMSE Q50 MWI Q10 / Q90
The day's per-timestamp values are averaged into a single number. For Q50, the square root is taken once, turning mean squared residuals into RMSE. For Q10/Q90, the daily Mean Winkler Interval (MWI) is the arithmetic mean directly.

Observation gaps don't count against you

If the buyer's measurements are incomplete for a target day, the day is skipped for everyone across every horizon. No score, no penalty, no missed-day mark. Buyer-side data hiccups never propagate into your record.

Score recalculation

Because data providers sometimes revise measurements, scores are recalculated on a rolling schedule:

  • Every hour — score the previous days forecasts.
  • Through day 7 of the next month — recompute scores for the current and previous months.
  • After day 7 — recompute the current month only. Scores for the previous month are locked in, even if data providers later revise their measurements for that month. This is a hard cutoff — no exceptions.