The Sisyphus Problem
in Quantitative Finance

Five ways machine learning goes quietly wrong in a trading firm — and how to stop it.

"Most financial ML research does not fail because the models are wrong. It fails because the process around the models is wrong — invisible leakage, siloed notebooks, and unauditable black boxes." — Marcos Lopez de Prado, Advances in Financial Machine Learning

Background

Imagine a team of researchers, each building their own predictive models in private notebooks. Every Monday a researcher starts fresh — re-engineering features that a colleague built last month, fitting a model with a CV loop copied from Stack Overflow, and generating a metric table that lives only in a Slack message. By Friday, the boulder is back at the bottom of the hill.

This is what Marcos Lopez de Prado calls the Sisyphus Paradigm: brilliant individual effort that produces nothing durable. The script in this project demonstrates five concrete fixes using Skore, a Python MLOps library, against a synthetic equity dataset.

Each section below states the problem in plain English, explains Lopez de Prado's proposed remedy, and shows exactly how Skore implements it.

Look-Ahead Bias

AFML Chapter 7 — Purged & Embargoed Cross-Validation

The Problem

Standard cross-validation randomly shuffles data before splitting it into train and test sets. For stock prices this is catastrophic: tomorrow's data ends up in today's training set.

The model silently learns from the future. It scores brilliantly in backtests and collapses in live trading. No error message, no warning — just a "time machine" hidden in the evaluation loop.

🎯 Like asking a student to grade an exam using the answer sheet that was accidentally left in the room — the score looks perfect but proves nothing.

Lopez de Prado's Fix

Replace random k-fold with a Purged K-Fold that respects time:

1. No shuffling. Folds must preserve chronological order. Earlier data trains the model; later data tests it.

2. Purge overlapping labels. If a label is built from a 5-day window, the 5 training samples immediately before the test set must be removed — their labels "bleed" into the test period.

3. Add an embargo gap. Even after the test window ends, block a short buffer from entering the next training fold. Autoregressive models can still "see" recent residuals without this guard.

How Skore Addresses It

The script creates a custom EmbargoPurgedTimeSeriesSplit splitter and passes it directly into CrossValidationReport via the splitter= argument.

Because Skore's evaluation engine accepts any sklearn-compatible splitter, the temporal guardrails are permanently bound to the evaluation — they cannot accidentally be removed by a colleague who clones the code and "just wants a quick result."

class EmbargoPurgedTimeSeriesSplit(TimeSeriesSplit): def split(self, X, ...): for train, test in super().split(X): embargo = max(1, int(len(test) * 0.01)) yield train[:-embargo], test # purge boundary rows cv_report = CrossValidationReport( estimator=model, X=X, y=y, splitter=purged_cv # guardrails locked in )

Results Comparison

With standard 5-fold random shuffle, the model appears to predict the market. With purged time-series CV, the score drops to near chance — the correct answer for a synthetic random walk.

Random k-fold

0.68

Purged CV

0.53

The random k-fold ROC-AUC of 0.68 is entirely artificial — the model was trained on rows that chronologically followed the test set. The purged score of 0.53 is honest: barely above chance, as expected for a near-random price process.

ROC-AUC reported as mean across folds.

Inefficient Alpha Search

AFML Chapters 11 & 14 — Automated Evaluation

The Problem

Every time a researcher wants to evaluate a model, they write the same boilerplate: a cross-validation loop, then 6–10 separate metric calls (roc_auc_score, accuracy_score, precision_score, …), then manual assembly into a results dictionary, then paste into a spreadsheet.

This is not a small inconvenience. It is the main reason researchers test fewer models than they should. The overhead is so high that promising ideas get dropped after one or two evaluations.

⚙️ Like a factory where every worker hand-files the same bolt before assembly — the product is fine but the throughput is a tenth of what a jig would give.

Lopez de Prado's Fix

Standardise evaluation as a reusable pipeline step, not a collection of one-off scripts. The evaluation framework should:

• Run folds in parallel, not sequentially.
• Cache predictions so metrics can be recomputed without refitting.
• Return a single summary object rather than a bag of floats.
• Allow swapping the model in one place without touching the evaluation code.

The goal is that comparing RandomForest vs GradientBoosting should cost the researcher exactly one line change.

How Skore Addresses It

CrossValidationReport replaces the entire evaluation loop. Pass in the model, data, and splitter — Skore runs the folds in parallel, caches all fold predictions, and exposes every metric as a method on the same object.

Changing the model requires updating one variable. The rest of the code is identical.

# 20 lines of boilerplate → 1 object cv_report = CrossValidationReport( estimator=model_pipeline, X=X, y=y, splitter=purged_cv, n_jobs=-1 ) # All metrics on demand, no re-fitting cv_report.metrics.summarize().frame() cv_report.metrics.roc_auc() cv_report.metrics.accuracy()

Results Comparison

Effort to evaluate one new model hypothesis:

Step	Manual	Skore
Run CV folds	~15 lines	1 line
Compute metrics	~10 lines	1 line
Aggregate across folds	~8 lines	built-in
Swap model	edit 6+ places	1 variable
Total boilerplate	~33 lines	4 lines

The reduction is not cosmetic — fewer lines means faster iteration, fewer copy-paste errors, and more model ideas tested per day.

Siloed Artefacts

AFML Chapter 1 — Standardised Pipeline Persistence

The Problem

Researcher A trains a good model in a Jupyter notebook called Untitled7.ipynb on their laptop. The fitted model object lives in memory. When the kernel restarts, it is gone.

Even when researchers remember to save things, each person has their own folder structure, their own naming conventions, and their own version of the dataset. Two researchers working on "the same" strategy silently diverge — and neither knows it until the results conflict in a review meeting.

📦 Like a restaurant where every chef keeps their own private pantry with no labels and no shared inventory — the kitchen works until the head chef is sick.

Lopez de Prado's Fix

Treat model artefacts as first-class, versioned objects stored in a shared, typed registry — not as files that happen to be on disk somewhere.

The registry should enforce a schema: only valid, evaluated model reports can be stored. Raw data and transformers should be stored in a complementary data layer (Lopez de Prado recommends pairing this with tools like DVC or joblib), keeping the two concerns separate.

Re-opening the project by name should give any researcher identical access to every evaluated model, with no setup required.

How Skore Addresses It

Project is a SQLite-backed typed registry. project.put(key, report) only accepts EstimatorReport or CrossValidationReport objects — the schema is enforced by the API, not by convention.

Raw artefacts (feature DataFrames, metadata) are saved with joblib alongside the project — a clean data-layer / model-layer separation.

# Data layer (raw artefacts) joblib.dump(signals_df, "artefacts/signals_df.pkl") # Model layer (typed Skore registry) project = Project("cfm-rf-signal-v1", mode="local", workspace=PROJECT_DIR) project.put("cv_report_purged_kfold", cv_report) project.put("prod_estimator_report", est_report)

Results Comparison

Before and after: what a researcher can see about a stored model.

Question	Without Skore	With Skore
What data was used?	Unknown	signals_df.pkl, versioned
What CV strategy?	Read the notebook	In metadata dict
What was the AUC?	Search Slack history	project.summarize()
Safe for live trading?	Ask the author	"intended_use" field
Time to onboard	2–3 days	~10 minutes

Collaboration Friction

AFML Chapter 1 — Shared Research Ledger

The Problem

Researcher B joins the team in week 4. They want to build on Researcher A's signal work. But A's notebook is uncommented, on a personal machine, and the dataset A used was generated by a script that no longer runs because a dependency was updated.

B spends three days recreating work that already exists. This is not laziness — it is a structural failure of the artefact system. The boulder rolls back.

🔬 Like a laboratory where experiments are never written up — every new student redoes the same titration from scratch because the previous results are in someone's head.

Lopez de Prado's Fix

Every model artefact must be self-documenting at the point of storage — not in a separate wiki that falls out of date, but bound directly to the artefact itself.

Required provenance: who built it, when, with what data version, with what CV strategy, what it is intended for, and what it must not be used for.

A shared index of all stored models — queryable by metric performance — means any researcher can discover existing work in seconds rather than hours of asking around.

How Skore Addresses It

project.summarize() returns a metrics-annotated index of every report in the project — no notebook required.

A metadata dict saved alongside the project acts as a machine-readable logbook entry. Any researcher can load it and immediately understand the full provenance of the work.

# Any team member can discover all stored models project.summarize() # metrics table for every report # Self-documenting provenance, stored with the artefact metadata = { "author": "Lawrence — CFM Quant Research", "cv_strategy": "EmbargoPurged (5 folds, 1% embargo)", "intended_use": "Research prototype. Needs compliance sign-off.", "features": feature_names, }

Results Comparison

Without Skore: one number buried in a dict. With Skore: a shared, versioned, queryable project that any team member can open.

Artefact	Ad-hoc	Skore Project
CV Report	Lost on restart	cv_report_purged_kfold
Prod model	joblib file, unnamed	prod_estimator_report
Feature data	Regenerated each time	artefacts/signals_df.pkl
Discovery	Ask author directly	project.summarize()

Opaque Risk

AFML Chapter 8 — Transparent Feature Importance & Auditability

The Problem

A model scores well in backtest. The compliance team asks: "What is this model actually relying on? Could it be betting on a spurious relationship?"

The researcher cannot answer. The feature importance array was computed once and forgotten. There is no out-of-sample audit. The model is a black box to anyone who did not write it — including, a month later, the researcher themselves.

A model that cannot explain itself cannot be deployed. In a regulated environment, this is not a soft constraint.

🧾 Like an accountant handing in a balance sheet with no workings — the number might be correct, but no auditor will sign it off.

Lopez de Prado's Fix

Generate two independent views of feature importance and present both to compliance:

Mean Decrease in Impurity (MDI) — the model's "self-assessment" of what it relied on, computed at fit-time. Fast but can be biased toward features with many unique values.

Permutation Importance — an external audit: shuffle each feature and measure how much model performance drops. Model-agnostic, always out-of-sample. Agreement between the two builds confidence. Divergence is a warning signal.

Pair these with a ROC curve and confusion matrix for a complete compliance package.

How Skore Addresses It

Both importance views are available as one-liners on their respective report objects. Skore aggregates MDI across all 5 CV folds (mean ± std), making it far more robust than the usual single-fit estimate. All four outputs are saved as publication-ready PNG files.

# MDI: cross-validated, mean ± std across 5 folds mdi = cv_report.inspection.impurity_decrease() mdi.plot() → feature_importance_mdi.png # Permutation: out-of-sample audit, 20 repeats perm = est_report.inspection.permutation_importance( data_source="test", n_repeats=20) perm.plot() → feature_importance_permutation.png # Compliance package est_report.metrics.roc().plot() → roc_curve.png est_report.metrics.confusion_matrix() → confusion_matrix.png

Results Comparison

MDI vs Permutation importance rankings — the two methods largely agree, which builds confidence. A divergence would be a warning signal.

Feature	MDI rank	Perm rank	Agreement
rvol_20d	#1	#1	✓
zscore_20d	#2	#3	~
rsi_14d	#3	#2	~
mom_21d	#4	#4	✓
vol_divergence	#5	#6	~
mom_5d	#6	#5	~
mom_1d	#7	#7	✓

Top-3 features match between methods — a green light for the compliance report.

Summary

Use Case	Without Skore	With Skore
1 — Look-Ahead Bias	Random k-fold silently creates a time machine; no safeguard against accidental shuffling	Custom splitter injected into CrossValidationReport — temporal guardrails are permanent and enforceable
2 — Alpha Search	20+ lines of metric boilerplate per experiment; researchers test fewer models because the overhead is high	CrossValidationReport runs folds in parallel; all metrics on demand from one cached object
3 — Siloed Artefacts	Models live in notebook memory; datasets in personal folders with no shared schema	Typed SQLite registry; joblib for raw data; re-opening by name gives any researcher identical access
4 — Collaboration	Researcher B spends days recreating Researcher A's work; no shared index of what exists	project.summarize() gives a queryable metrics index; metadata dict records full provenance
5 — Opaque Risk	Feature importance computed once, never audited out-of-sample; model fails compliance review	MDI (cross-validated) + permutation (out-of-sample) + ROC + confusion matrix in 4 lines

The Sisyphus Problemin Quantitative Finance

The Sisyphus Problem
in Quantitative Finance