Essays·Jun 6, 2026·9 min read

Why every backtest you've ever seen is too optimistic.

Three structural reasons a crypto strategy backtest beats the live version — the universe the backtest got to pick, the costs it quietly skipped, and the fills it never had to fight for. What to look for, and what to demand before you trust a curve.

EN
The Engine Team
Dusk Labs
Share

Every backtest equity curve you've ever seen sloped up. That's not because the strategies were good. It's because the curves that sloped down were thrown away before anyone showed you the chart.

That's the kind sentence. The unkind sentence is that even the curves that survived selection are still wrong — wrong in the same direction, every time, by an amount that's larger than most people think. The gap between a backtest and a live run is not noise around an unbiased estimate. It is a systematic discount, and the discount is roughly the size of your edge.

This post is about why. Three structural reasons a crypto strategy backtest overstates a strategy's real return, in roughly the order they bite. None of them are exotic. All of them are present in almost every backtest screenshot posted on crypto Twitter. The point is not to be cynical about backtesting — backtesting is the only way to evaluate a systematic strategy before you risk money on it — but to be specific about which parts of the output deserve trust and which parts are theater.

If you trade Hyperliquid perpetuals or any other liquid perp venue, you've seen the genre: a smooth blue line climbing from $10k to $300k over twelve months, "30 trades, 71% win rate, max drawdown -4.8%." The line is real in the sense that the math is correct. The line is fake in the sense that it cannot be reproduced. Here is why.

The universe the backtest got to pick

The first and worst problem is survivorship in the asset list.

A backtest needs to choose which markets it trades. The honest version is: at every point in history, give the strategy access only to the perps that were tradeable on that date, with the volume and listing dates that were actually true. The dishonest version, which is what you actually see, is: take today's list of liquid perps and run the strategy on the history of all of them.

These two versions look almost identical. They are not. The dishonest version quietly assumes the strategy could have traded HYPE-PERP in late 2024 — except HYPE-PERP did not exist then. It assumes the strategy could have traded any number of perps that ran 8x and survived; but the equally numerous perps that ran 8x, blew up, and got delisted have been silently dropped from the universe. The universe the backtest got to choose from is, by construction, the set of assets that are around to be chosen now. That is not a random sample of "perps." It is the right tail.

Practical effect on Hyperliquid: this bias is usually worth 10–30% per year on a momentum-flavored strategy. We have measured it. Build a long-momentum strategy on a current list of fifteen perps over the last 18 months, then rebuild it letting only the perps that existed on each date into the universe, and the live-feasible curve will be meaningfully flatter than the curve you originally drew.

You cannot fix this with a "robustness check." You have to fix it by reconstructing the universe at every point in time, which is annoying and which approximately nobody on social media does.

The costs the backtest pretended weren't there

The second problem is that most backtests model fills as if the venue were free, instant, and infinitely deep.

We have read approximately a hundred crypto backtest writeups. Roughly three include a serious treatment of fees, slippage, and funding. The rest implicitly assume:

  • Every entry fills at the mid.
  • Every exit fills at the mid.
  • Taker fees are 0 bps, or "we'll just make every trade a maker."
  • Funding payments are zero or annualize to something tiny.
  • Latency from signal to fill is zero.

Each of these is wrong by an amount, and the amounts add up. On a perp venue like Hyperliquid, a real round-trip costs roughly:

  • 3–5 bps taker fee in, 3–5 bps taker fee out (the maker assumption is usually a fantasy on the volatile bars where signals fire).
  • 2–10 bps slippage depending on size and book depth, much worse on a thin perp during a vol spike.
  • 0–4 bps of funding cost over the average hold, more if you're systematically on the paying side of the funding regime.
  • Up to several bps of execution latency cost, because the signal fires on the bar close and your order fills against a price that's already moved.

Pencil the round-trip in at 15 bps in fees and slippage on the good days and 40+ bps on the bad days. A strategy that earns 30 bps per trade in a frictionless backtest earns nothing in a frictionful live run. We have run that experiment more than once and it has gone exactly that way every time.

The two failure modes here have a specific shape. The first is the strategy that turns 65% wins on a +60 bps target into 51% wins on a +25 bps net target — same edge, smaller margin, much higher sensitivity to a bad week. The second is the strategy that needed a 0.012% per hour funding edge to be profitable and is now barely breakeven after you charge it the funding it actually had to pay on the wrong side of the book during half its holds. Cost realism kills more strategies in production than alpha decay does.

The fills the backtester wrote, the fills the venue gives you

The third problem is the most subtle and the most expensive: backtests model fills as a function of price, when in real life fills are a function of order flow.

A backtest says: at 09:14:23, the close of the 1-minute bar was $3,847; your strategy generated an entry; you bought 0.42 ETH at $3,847. The venue, on the same input, says: at 09:14:23, the close of the 1-minute bar was $3,847, but you sent a market order with a 12-bp slippage budget, the book had moved 4 bps in the half-second between bar close and order arrival, your market order swept the first two levels and filled at $3,849 average, the maker side of those levels was your own previous resting order that you'd forgotten about, and by the way another agent on the same signal source filled half a second earlier and pushed the book.

None of those frictions exist in the backtest. They are not noise. They are a structural transfer from your wallet to other people's wallets, and the people on the receiving end have whole desks devoted to ensuring the transfer keeps happening.

The mistake is not "your slippage model was 5 bps when it should have been 8." The mistake is treating slippage as a fixed cost rather than as a function of the regime your signal fires in. Signals that fire on vol expansions are precisely the signals whose fills are worst. Signals that fire on quiet bars are the ones where fills are cheapest, but quiet bars rarely produce edge. The very regimes where your strategy thinks it sees the most opportunity are the regimes where the venue charges you the most for executing.

A useful exercise: take your backtest and recompute it with a slippage model that scales as a function of contemporaneous 5-minute realized volatility, with a multiplier of 3x in the top vol quintile. The curve will get noticeably flatter. That model is still optimistic — real venues do worse — but it is closer to the truth than a flat 5 bps charge.

What an honest backtest looks like

Allowing for all three reasons, what should you actually expect from a "good" backtest?

  • A live-feasible universe reconstructed at every point in time, with listings, delistings, and volume floors applied historically rather than retroactively.
  • Fees charged at the venue's actual schedule, with the assumption that a meaningful share of trades will be taker fills, not maker fills.
  • Slippage modeled as a function of size and contemporaneous volatility, not as a flat constant.
  • Funding payments accumulated on each open position, hour by hour, on the side the position is actually on.
  • An execution lag between signal and order — even one or two seconds — to reflect the gap between bar close and order arrival.
  • A walk-forward split, not a single train/test cut, so you can see whether the parameters chosen on the last 12 months still earn money in the 3 months after.

When you do all of this, the equity curve will sag visibly compared to the original. It will sometimes look bad. That is the point. A backtest that sags under honest costs and still earns money is a strategy worth running. A backtest that looks great until you turn off the assumptions is a story you've been telling yourself.

Our own internal target: we won't deploy a strategy into the marketplace whose live-realistic backtest doesn't keep at least 60% of the headline equity-curve return. Most candidates don't pass.

What Engine does about this

The reason backtests get away with their fictions is that no one usually gets to compare a real backtest with the real live decision log it produced. The vendor shows you one. The user gets the other. The two never meet.

Engine flips this. The decision log records every read, every rule check, every order, every fill — with the rule that triggered it, the price the agent thought it was getting, and the price it actually got. That log is the same shape as the backtest output. You can lay one on top of the other, by trade, and see exactly where the backtest overstated. Most of the time the gap is fees plus slippage plus the trade the agent decided to skip in live that the backtest cheerfully took.

When the gap is small, the strategy is real. When the gap is large, the backtest was selling you a story. There is no third option, and there is no way to know which one you have without the side-by-side comparison.

If you're evaluating a backtest from anywhere — ours, a friend's, a marketplace listing — ask for the live decision log of the strategy running the same rules. If the answer is "we don't keep one" or "it's proprietary," you have your answer about the backtest too.

The honest curve is the one that survived honest costs. Everything else is a brochure.

backtestperpresearchstrategyessay
Share
Keep reading
Essays·May 28, 2026

Engine vs. every other trading bot.

Three structural differences between Engine and every other trading bot in the category — where your funds sit, what you can see, and whose strategy is actually running.

5 min read
Essays·May 1, 2026

The case for transparent agent trading.

Most automated trading is sold as a black box. Why we built Engine the other way around, and what transparency means in practice.

8 min read
Market·May 8, 2026

Funding rates on Hyperliquid: a field guide for the merely curious.

How funding works on Hyperliquid: the only perp fee that pays one side at the other's expense, and how to read the regimes that matter.

9 min read