GUIDE12 MINFOR CRO MANAGERSFOR GROWTH LEADERS

Statistical significance, actually explained.

P-values without the jargon. When to use Frequentist vs Bayesian. When to stop a test early without lying to yourself. A practical primer written for CRO teams — not statisticians.

◆ TL;DR

▸A p-value does not tell you the probability your variant is better. It tells you something much more boring.
▸95% confidence is a convention, not a law. Pick a threshold that matches the cost of being wrong.
▸Bayesian methods feel more intuitive but require clearer thinking about priors.
▸Peeking at tests mid-flight inflates false-positive rates massively. Use sequential tests or pre-commit to sample size.

01The p-value

What it actually means.

A p-value is the probability of seeing your observed result (or something more extreme) IF the null hypothesis were true. It is NOT the probability that your variant is better. This distinction matters because almost everyone gets it wrong — and the wrong interpretation leads to overconfident decisions.

▸Correct: 'If the variant had no real effect, I'd see this result 3% of the time.'
▸Incorrect: 'There's a 97% chance the variant is better.'
▸The 5% threshold (α=0.05) is a convention. Not a law.

One practical framing: a p-value is a measure of surprise. Low p-values mean 'this result would be surprising if nothing were happening'. It doesn't prove anything about your variant — it just tells you the data looks unusual.

02Frequentist vs Bayesian

Pick one. Stick with it.

Both Frequentist and Bayesian methods produce reliable results if used correctly. The key failure mode is mixing them — running a Frequentist test and then interpreting it Bayesian-ly. Pick one framework, train your team on it, and commit.

▸Frequentist: p-values, confidence intervals. Well-understood, rigid stopping rules.
▸Bayesian: posterior probabilities, credible intervals. More intuitive, requires thought about priors.
▸For teams new to stats: Bayesian methods usually feel more natural.
▸For teams running many tests in parallel: Frequentist with sequential correction (SPRT or similar) scales better.

03Sample size

Pre-commit. Always.

The single biggest source of false positives in A/B testing is 'peeking' — checking the test mid-flight and stopping when significance appears. This inflates the true false-positive rate from 5% to 30%+ over multiple peeks. Pre-commit to a sample size before the test runs.

▸Derive sample size from MDE (minimum detectable effect), baseline CR, and desired power (usually 80%).
▸Use a calculator — don't eyeball. Evan Miller's is the classic.
▸If you must peek: use sequential tests (SPRT, always-valid p-values) to prevent the inflation.
▸Bayesian methods handle peeking better — but only if you don't misuse them as 'stop when posterior hits 95%'.

04Guardrails

The metrics that prevent Pyrrhic wins.

A test can 'win' on the primary metric but hurt the business. Guardrail metrics prevent this. Every test should monitor at least one guardrail — usually engagement, retention, or support volume — and halt if it moves materially the wrong direction.

▸Engagement (session duration, pageviews per session).
▸Retention (7-day, 30-day return rate).
▸Support volume (ticket rate per user).
▸Any guardrail regression >15% should halt the test for review.

05Early stopping

When it's fine. When it's fraud.

Stopping tests early is sometimes OK — and sometimes an unconscious form of p-hacking. The distinction is whether you pre-committed to a stopping rule. If you did, early stopping is fine. If you didn't, you're probably picking the moment that matches the answer you want.

▸Pre-committed stopping rule (e.g., 'stop if SPRT crosses threshold'): FINE.
▸Peeking and stopping when you see 'significance': NOT FINE.
▸Stopping for business reasons ('we need to ship'): FINE, but acknowledge the statistical weakness.
▸Stopping because the result matches your hypothesis: UNCONSCIOUSLY FRAUDULENT.

Related for your role

ALL RESOURCES →

CALCULATOR

Stack Consolidation ROI Calculator

Enter what you pay Optimizely, Crayon, Hotjar, and Ahrefs today. See what Optimize Pilot would cost instead — and how many headcount the delta covers.

Run the numbers →

CALCULATOR

A/B Test Sample Size Calculator

Enter your baseline conversion rate, minimum detectable effect, and weekly traffic. Get the required sample size per variant and an estimated test duration.

Size your test →

PLAYBOOK18 MIN

The Testing Velocity Playbook

How high-performing CRO teams ship more experiments without sacrificing statistical rigor. Includes the idea-to-ship workflow we see work in practice.

OPEN →

◉ STATS WITHOUT THE STATS TEAM

Let the math happen. Read the result.

Flight Deck runs Bayesian stats on every experiment by default. Auto-promotion at 95% posterior. No p-hacking, no peeking, no decision meetings.

Book a 15-min stack audit →

◆ Loading…

What it actually means.

▸Correct: 'If the variant had no real effect, I'd see this result 3% of the time.'

▸Incorrect: 'There's a 97% chance the variant is better.'

▸The 5% threshold (α=0.05) is a convention. Not a law.

Pick one. Stick with it.

▸Frequentist: p-values, confidence intervals. Well-understood, rigid stopping rules.

▸Bayesian: posterior probabilities, credible intervals. More intuitive, requires thought about priors.

▸For teams new to stats: Bayesian methods usually feel more natural.

▸For teams running many tests in parallel: Frequentist with sequential correction (SPRT or similar) scales better.

Pre-commit. Always.

▸Derive sample size from MDE (minimum detectable effect), baseline CR, and desired power (usually 80%).

▸Use a calculator — don't eyeball. Evan Miller's is the classic.

▸If you must peek: use sequential tests (SPRT, always-valid p-values) to prevent the inflation.

▸Bayesian methods handle peeking better — but only if you don't misuse them as 'stop when posterior hits 95%'.

The metrics that prevent Pyrrhic wins.

▸Engagement (session duration, pageviews per session).

▸Retention (7-day, 30-day return rate).

▸Support volume (ticket rate per user).

▸Any guardrail regression >15% should halt the test for review.

When it's fine. When it's fraud.

▸Pre-committed stopping rule (e.g., 'stop if SPRT crosses threshold'): FINE.

▸Peeking and stopping when you see 'significance': NOT FINE.

▸Stopping for business reasons ('we need to ship'): FINE, but acknowledge the statistical weakness.

▸Stopping because the result matches your hypothesis: UNCONSCIOUSLY FRAUDULENT.