Five ways machine learning goes quietly wrong in a trading firm — and how to stop it.
"Most financial ML research does not fail because the models are wrong. It fails because the process around the models is wrong — invisible leakage, siloed notebooks, and unauditable black boxes." — Marcos Lopez de Prado, Advances in Financial Machine Learning
Imagine a team of researchers, each building their own predictive models in private notebooks. Every Monday a researcher starts fresh — re-engineering features that a colleague built last month, fitting a model with a CV loop copied from Stack Overflow, and generating a metric table that lives only in a Slack message. By Friday, the boulder is back at the bottom of the hill.
This is what Marcos Lopez de Prado calls the Sisyphus Paradigm: brilliant individual effort that produces nothing durable. The script in this project demonstrates five concrete fixes using Skore, a Python MLOps library, against a synthetic equity dataset.
Each section below states the problem in plain English, explains Lopez de Prado's proposed remedy, and shows exactly how Skore implements it.
Standard cross-validation randomly shuffles data before splitting it into train and test sets. For stock prices this is catastrophic: tomorrow's data ends up in today's training set.
The model silently learns from the future. It scores brilliantly in backtests and collapses in live trading. No error message, no warning — just a "time machine" hidden in the evaluation loop.
Replace random k-fold with a Purged K-Fold that respects time:
1. No shuffling. Folds must preserve chronological order. Earlier data trains the model; later data tests it.
2. Purge overlapping labels. If a label is built from a 5-day window, the 5 training samples immediately before the test set must be removed — their labels "bleed" into the test period.
3. Add an embargo gap. Even after the test window ends, block a short buffer from entering the next training fold. Autoregressive models can still "see" recent residuals without this guard.
The script creates a custom EmbargoPurgedTimeSeriesSplit splitter and passes
it directly into CrossValidationReport via the splitter= argument.
Because Skore's evaluation engine accepts any sklearn-compatible splitter, the temporal guardrails are permanently bound to the evaluation — they cannot accidentally be removed by a colleague who clones the code and "just wants a quick result."
With standard 5-fold random shuffle, the model appears to predict the market. With purged time-series CV, the score drops to near chance — the correct answer for a synthetic random walk.
The random k-fold ROC-AUC of 0.68 is entirely artificial — the model was trained on rows that chronologically followed the test set. The purged score of 0.53 is honest: barely above chance, as expected for a near-random price process.
ROC-AUC reported as mean across folds.
Every time a researcher wants to evaluate a model, they write the same boilerplate: a
cross-validation loop, then 6–10 separate metric calls (roc_auc_score,
accuracy_score, precision_score, …), then manual assembly into
a results dictionary, then paste into a spreadsheet.
This is not a small inconvenience. It is the main reason researchers test fewer models than they should. The overhead is so high that promising ideas get dropped after one or two evaluations.
Standardise evaluation as a reusable pipeline step, not a collection of one-off scripts. The evaluation framework should:
• Run folds in parallel, not sequentially.
• Cache predictions so metrics can be recomputed without refitting.
• Return a single summary object rather than a bag of floats.
• Allow swapping the model in one place without touching the evaluation code.
The goal is that comparing RandomForest vs GradientBoosting should cost the researcher exactly one line change.
CrossValidationReport replaces the entire evaluation loop. Pass in the
model, data, and splitter — Skore runs the folds in parallel, caches all fold predictions,
and exposes every metric as a method on the same object.
Changing the model requires updating one variable. The rest of the code is identical.
Effort to evaluate one new model hypothesis:
| Step | Manual | Skore |
|---|---|---|
| Run CV folds | ~15 lines | 1 line |
| Compute metrics | ~10 lines | 1 line |
| Aggregate across folds | ~8 lines | built-in |
| Swap model | edit 6+ places | 1 variable |
| Total boilerplate | ~33 lines | 4 lines |
The reduction is not cosmetic — fewer lines means faster iteration, fewer copy-paste errors, and more model ideas tested per day.
Researcher A trains a good model in a Jupyter notebook called Untitled7.ipynb on their laptop. The fitted model object lives in memory. When the kernel restarts, it is gone.
Even when researchers remember to save things, each person has their own folder structure, their own naming conventions, and their own version of the dataset. Two researchers working on "the same" strategy silently diverge — and neither knows it until the results conflict in a review meeting.
Treat model artefacts as first-class, versioned objects stored in a shared, typed registry — not as files that happen to be on disk somewhere.
The registry should enforce a schema: only valid, evaluated model reports can be stored. Raw data and transformers should be stored in a complementary data layer (Lopez de Prado recommends pairing this with tools like DVC or joblib), keeping the two concerns separate.
Re-opening the project by name should give any researcher identical access to every evaluated model, with no setup required.
Project is a SQLite-backed typed registry. project.put(key, report)
only accepts EstimatorReport or CrossValidationReport objects —
the schema is enforced by the API, not by convention.
Raw artefacts (feature DataFrames, metadata) are saved with joblib alongside
the project — a clean data-layer / model-layer separation.
Before and after: what a researcher can see about a stored model.
| Question | Without Skore | With Skore |
|---|---|---|
| What data was used? | Unknown | signals_df.pkl, versioned |
| What CV strategy? | Read the notebook | In metadata dict |
| What was the AUC? | Search Slack history | project.summarize() |
| Safe for live trading? | Ask the author | "intended_use" field |
| Time to onboard | 2–3 days | ~10 minutes |
Researcher B joins the team in week 4. They want to build on Researcher A's signal work. But A's notebook is uncommented, on a personal machine, and the dataset A used was generated by a script that no longer runs because a dependency was updated.
B spends three days recreating work that already exists. This is not laziness — it is a structural failure of the artefact system. The boulder rolls back.
Every model artefact must be self-documenting at the point of storage — not in a separate wiki that falls out of date, but bound directly to the artefact itself.
Required provenance: who built it, when, with what data version, with what CV strategy, what it is intended for, and what it must not be used for.
A shared index of all stored models — queryable by metric performance — means any researcher can discover existing work in seconds rather than hours of asking around.
project.summarize() returns a metrics-annotated index of every report in
the project — no notebook required.
A metadata dict saved alongside the project acts as a machine-readable logbook entry. Any researcher can load it and immediately understand the full provenance of the work.
Without Skore: one number buried in a dict. With Skore: a shared, versioned, queryable project that any team member can open.
| Artefact | Ad-hoc | Skore Project |
|---|---|---|
| CV Report | Lost on restart | cv_report_purged_kfold |
| Prod model | joblib file, unnamed | prod_estimator_report |
| Feature data | Regenerated each time | artefacts/signals_df.pkl |
| Discovery | Ask author directly | project.summarize() |
A model scores well in backtest. The compliance team asks: "What is this model actually relying on? Could it be betting on a spurious relationship?"
The researcher cannot answer. The feature importance array was computed once and forgotten. There is no out-of-sample audit. The model is a black box to anyone who did not write it — including, a month later, the researcher themselves.
A model that cannot explain itself cannot be deployed. In a regulated environment, this is not a soft constraint.
Generate two independent views of feature importance and present both to compliance:
Mean Decrease in Impurity (MDI) — the model's "self-assessment" of what it relied on, computed at fit-time. Fast but can be biased toward features with many unique values.
Permutation Importance — an external audit: shuffle each feature and measure how much model performance drops. Model-agnostic, always out-of-sample. Agreement between the two builds confidence. Divergence is a warning signal.
Pair these with a ROC curve and confusion matrix for a complete compliance package.
Both importance views are available as one-liners on their respective report objects. Skore aggregates MDI across all 5 CV folds (mean ± std), making it far more robust than the usual single-fit estimate. All four outputs are saved as publication-ready PNG files.
MDI vs Permutation importance rankings — the two methods largely agree, which builds confidence. A divergence would be a warning signal.
| Feature | MDI rank | Perm rank | Agreement |
|---|---|---|---|
| rvol_20d | #1 | #1 | ✓ |
| zscore_20d | #2 | #3 | ~ |
| rsi_14d | #3 | #2 | ~ |
| mom_21d | #4 | #4 | ✓ |
| vol_divergence | #5 | #6 | ~ |
| mom_5d | #6 | #5 | ~ |
| mom_1d | #7 | #7 | ✓ |
Top-3 features match between methods — a green light for the compliance report.
| Use Case | Without Skore | With Skore |
|---|---|---|
| 1 — Look-Ahead Bias | Random k-fold silently creates a time machine; no safeguard against accidental shuffling | Custom splitter injected into CrossValidationReport — temporal guardrails are permanent and enforceable |
| 2 — Alpha Search | 20+ lines of metric boilerplate per experiment; researchers test fewer models because the overhead is high | CrossValidationReport runs folds in parallel; all metrics on demand from one cached object |
| 3 — Siloed Artefacts | Models live in notebook memory; datasets in personal folders with no shared schema | Typed SQLite registry; joblib for raw data; re-opening by name gives any researcher identical access |
| 4 — Collaboration | Researcher B spends days recreating Researcher A's work; no shared index of what exists | project.summarize() gives a queryable metrics index; metadata dict records full provenance |
| 5 — Opaque Risk | Feature importance computed once, never audited out-of-sample; model fails compliance review | MDI (cross-validated) + permutation (out-of-sample) + ROC + confusion matrix in 4 lines |