Statistical significance, actually explained.
P-values without the jargon. When to use Frequentist vs Bayesian. When to stop a test early without lying to yourself. A practical primer written for CRO teams — not statisticians.
- ▸A p-value does not tell you the probability your variant is better. It tells you something much more boring.
- ▸95% confidence is a convention, not a law. Pick a threshold that matches the cost of being wrong.
- ▸Bayesian methods feel more intuitive but require clearer thinking about priors.
- ▸Peeking at tests mid-flight inflates false-positive rates massively. Use sequential tests or pre-commit to sample size.
What it actually means.
A p-value is the probability of seeing your observed result (or something more extreme) IF the null hypothesis were true. It is NOT the probability that your variant is better. This distinction matters because almost everyone gets it wrong — and the wrong interpretation leads to overconfident decisions.
- ▸Correct: 'If the variant had no real effect, I'd see this result 3% of the time.'
- ▸Incorrect: 'There's a 97% chance the variant is better.'
- ▸The 5% threshold (α=0.05) is a convention. Not a law.
One practical framing: a p-value is a measure of surprise. Low p-values mean 'this result would be surprising if nothing were happening'. It doesn't prove anything about your variant — it just tells you the data looks unusual.
Pick one. Stick with it.
Both Frequentist and Bayesian methods produce reliable results if used correctly. The key failure mode is mixing them — running a Frequentist test and then interpreting it Bayesian-ly. Pick one framework, train your team on it, and commit.
- ▸Frequentist: p-values, confidence intervals. Well-understood, rigid stopping rules.
- ▸Bayesian: posterior probabilities, credible intervals. More intuitive, requires thought about priors.
- ▸For teams new to stats: Bayesian methods usually feel more natural.
- ▸For teams running many tests in parallel: Frequentist with sequential correction (SPRT or similar) scales better.
Pre-commit. Always.
The single biggest source of false positives in A/B testing is 'peeking' — checking the test mid-flight and stopping when significance appears. This inflates the true false-positive rate from 5% to 30%+ over multiple peeks. Pre-commit to a sample size before the test runs.
- ▸Derive sample size from MDE (minimum detectable effect), baseline CR, and desired power (usually 80%).
- ▸Use a calculator — don't eyeball. Evan Miller's is the classic.
- ▸If you must peek: use sequential tests (SPRT, always-valid p-values) to prevent the inflation.
- ▸Bayesian methods handle peeking better — but only if you don't misuse them as 'stop when posterior hits 95%'.
The metrics that prevent Pyrrhic wins.
A test can 'win' on the primary metric but hurt the business. Guardrail metrics prevent this. Every test should monitor at least one guardrail — usually engagement, retention, or support volume — and halt if it moves materially the wrong direction.
- ▸Engagement (session duration, pageviews per session).
- ▸Retention (7-day, 30-day return rate).
- ▸Support volume (ticket rate per user).
- ▸Any guardrail regression >15% should halt the test for review.
When it's fine. When it's fraud.
Stopping tests early is sometimes OK — and sometimes an unconscious form of p-hacking. The distinction is whether you pre-committed to a stopping rule. If you did, early stopping is fine. If you didn't, you're probably picking the moment that matches the answer you want.
- ▸Pre-committed stopping rule (e.g., 'stop if SPRT crosses threshold'): FINE.
- ▸Peeking and stopping when you see 'significance': NOT FINE.
- ▸Stopping for business reasons ('we need to ship'): FINE, but acknowledge the statistical weakness.
- ▸Stopping because the result matches your hypothesis: UNCONSCIOUSLY FRAUDULENT.
Related for your role
ALL RESOURCES →Stack Consolidation ROI Calculator
Enter what you pay Optimizely, Crayon, Hotjar, and Ahrefs today. See what Optimize Pilot would cost instead — and how many headcount the delta covers.
The Testing Velocity Playbook
How top CRO teams ship 6–8 experiments per quarter without sacrificing statistical rigor. Includes the idea-to-ship workflow we see in the top 10%.
Experiment Brief Template
The one-page brief every test should start from: hypothesis, variants, sample size, success metric, guardrails. Notion + Google Doc versions inside.
Let the math happen. Read the result.
Flight Deck runs Bayesian stats on every experiment by default. Auto-promotion at 95% posterior. No p-hacking, no peeking, no decision meetings.