Back to audits

We ran the audit on our own wallet-reputation gate. It was filtering against alpha, not for it.

Point-in-time test of whether 'follow informed wallets' actually works in the Polymarket cheap band. The result: our wave-34 gate produced a 9% hit rate. The wallets it filtered OUT hit 54%. We dropped the gate.

May 14, 2026·ohh.bet research·7 min read·methodologycalibrationalphaself-auditverdict

ohh.bet has been pitching two claims as one. They are not the same claim, and they don't deserve the same evidence bar.

Claim 1 — structural. Cheap-band BUYs on Polymarket have positive expected value because of the (1−P)/P payoff structure. At a 34% hit rate buying at price 0.30, the math gives 0.34 × 2.333 − 0.66 = +13.3% per trade. That number is arithmetic. True for any buyer.

Claim 2 — informed-capital. Beyond the structural advantage, wallets with prior track records hit cheap-band BUYs at meaningfully higher rates than first-timers or known losers. This is the actual edge claim — the one that justifies "ohh.bet is a tracker for informed capital." We built the wave-34 gate (resolved_trades ≥ 10 AND pnl_per_dollar ≥ 0.05) on the theory that this is true.

We ran the audit. Claim 2 is wrong. The gate filtered against alpha, not for it.

Methodology

For every historical $25K+ BUY in the live band (0.20–0.30) on a resolved Polymarket market, we looked up the trigger wallet's reputation as of the moment of the trade — not the current snapshot. Then bucketed:

  • A_proven — wallet had ≥10 resolved trades AND ≥5% PnL/$ at trade time
  • B_unproven — wallet had rep history but fell below the threshold
  • C_no_rep — wallet had no prior resolved trades at all

Point-in-time matters: a wallet with 50 resolved trades today might have had 0 last September. Snapshot rep would have leaked future information back into the historical classification. We built wallet_rep_events (~80K rows) — a per-event running snapshot of (resolved_trades, pnl_per_dollar) that includes only trades whose outcomes were knowable by that moment. The wallet_reputation_at(addr, ts) lookup gives us point-in-time rep for any historical timestamp.

The approximation: we treat markets_cache.end_date + 24h as the outcome-knowable time per market. Polymarket settlement typically lags end_date by hours; 24h is conservative.

The verdict

CohortnHitsHit rateNet ROI/trade
A_proven (passes wave-34 gate)1119.1%−68%
B_unproven (rep, below threshold)261453.9%+99%
C_no_rep (no prior resolved)11545.5%+76%

The A_proven cohort — the wallets our entire wave-34 gate was designed to surface — performed 45 percentage points worse on hit rate than the cohort we filtered OUT. The wave-34 gate was filtering AGAINST cheap-band alpha.

Sample size caveat: n=11 across 8 distinct wallets in A_proven. No single wallet dominates (the worst offender has 3 trades; the rest are n=1-2). The directional signal is too strong to dismiss as noise: A_proven hits at 9% vs the population average around 45-55%. The Wilson 95% CI on A_proven is [1.6%, 37.7%] — overlaps every other cohort, so call this directional, not statistically locked.

Wave 62 follow-up: the U-shape

The binary A/B/C split is hiding structure. We re-ran the audit with six rep tiers on the same data (cheap band 0.20–0.30, point-in-time rep):

Rep tierPnL/$ rangenWalletsHit % (95% CI)
No prior repfirst whale trade111045.5% [21.3, 72.0]
Known loser< 0%1844.4% [24.6, 66.3]
Mild winner0–5%475.0% [30.1, 95.4]
Good (wave-34 band)5–20%425.0% [4.6, 69.9]
High PnL/$20–50%20.0% [0.0, 65.8]
Elite≥ 50%837.5% [13.7, 69.4]

Numbers tightened by wave-67b: backfilled markets_cache.resolved_at from signal_outcomes (n=1,195 markets), replacing the end_date + 24h approximation for the rows the audit actually reads. mild_winner moved from 86% to 75%; some wallets reclassified into lower tiers because at true trade time they had less rep than the 24h inflation suggested. U-shape persists; the 86% headline was approximate.

Hit rate is non-monotonic in prior PnL/$. Two tiers carry positive expectancy — wallets with modest tracks (0–5%) and elite wallets (≥50%). The 5–50% middle is where the failure lives. The wave-34 gate threshold (≥5%, ≥10 trades) accidentally landed on the worst slice. It wasn't "rep is anti-predictive" — it was "the specific threshold band captures regression-to-mean wallets entering an out-of-distribution price range."

Split further by sample depth (rep × sample-depth grid on /transparency): mild_winner × 100+ resolved trades hits 80% (4/5 wins). Deep-sample wallets with calibrated PnL records win in the cheap band; mid-sample wallets with comfortable winning records lose.

Does it generalize? Yes. Wider band 0.20–0.40: A_proven n=41, 29.3% hit, −16% ROI vs B_unproven n=93, 41.9% / +30%. The binary split fails outside the live band too.

What about wallet concentration? mild_winner has 7 trades from 4 wallets — one wallet appears multiple times. Hot-streak risk is real. We've added a distinct-wallet column to every cohort row so this stays visible.

The headline (gate filters against alpha) holds. The actionable change is that the threshold itself was wrong, not the premise. A future revisit could test a non-monotonic gate that fires on (PnL/$ < 5%) OR (PnL/$ ≥ 50%, deep sample) — but n is too thin to commit to that today.

Why this is happening

Two hypotheses worth naming:

1. Proven wallets built their PnL on other bet types. Looking at the A_proven roster: every losing wallet has +20% to +60% historical PnL/$. They're not bad traders. But their edge was in favorites at 0.7+ or specific knowledge-based markets, not cheap-band longshots. When proven wallets dip into the cheap band, they're betting outside their domain. The cheap band attracts them precisely when they have informational disadvantage — they think they know better, the market disagrees, the market is usually right.

2. The (1−P)/P math is the actual edge. B_unproven's +99% net ROI per trade isn't because B_unproven wallets are smart. It's because the cheap-band math gives ANY buyer a positive expectancy. The structural claim was always the real one. We just framed it as wallet skill.

What we changed

Wave 61, today:

  1. Dropped allowed_gate_reasons:['reputation'] from paper portfolios A and B. They were systematically filtering toward the worst cohort. Now they auto-trade every cheap-band fire regardless of trigger-wallet rep.

  2. Removed the rep gate from the cheap_conviction detector. It now fires on every $25K+ BUY in [0.20, 0.30]. The wallet rep is still captured in detector_context.gateReason as a display tag (so the UI can show "this came from a proven wallet"), but firing is no longer conditional on it.

  3. Reframed /about and /methodology. Claim 2 has been tested and failed at the current sample. Claim 1 (structural cheap-band math) stands as the actual thesis. ohh.bet is no longer pitching as an "informed capital tracker"; it's a cheap-band activity monitor with audit-grade calibration receipts.

What this means for the product

The honest framing: the cheap band has positive math expectancy because of (1−P)/P payoffs. We surface every fire at scale, classify against our methodology, and publish every outcome. Whether a wallet is "informed" or "unproven" tells us nothing about cheap-band edge — possibly the opposite.

The cheap_conviction detector now fires roughly 2-3× more often (no rep filter). Paper portfolios A and B will absorb more trades. The thesis under test is the structural one, with the wallet identity discarded as a filter.

We'll re-run this audit weekly as more cheap-band BUYs accumulate. If at n=100+ the result still shows wallet rep as anti-predictive, the verdict is locked in. If it inverts as sample size grows, we'll publish that too.

Concrete next checkpoints (n thresholds at which we'll re-evaluate):

  • A_proven n ≥ 30 with Wilson upper bound still below baseline point estimate → verdict locked in
  • mild_winner n ≥ 20 with Wilson lower bound above 50% → consider an inverse-gate paper portfolio fired on this tier alone
  • mild_winner × deep-sample n ≥ 10 with Wilson lower bound above 60% → strongest candidate for a "calibrated-deep" gate

The audit infrastructure earned its keep

The point of pre-committing to publish whatever the data showed — laid out in the prior draft of this post — was to bind us to results we wouldn't have wanted to see. The result is exactly that: a 12-month-old framing premise turned out to be backward, and the receipt is here.

This is what calibration in public is supposed to look like. Not "we were right all along," but "we were wrong, here's the number, here's what we changed."


See the live cohort table on /transparency (refreshes weekly). Run your own backtest with any rule combination at /backtest. Methodology details on /methodology.

Receipts for every number live on /transparency. Methodology details at /methodology. Spot a problem with our reasoning? Drop a note via the feedback form.

All audits