← Back to Blog

Statistical Significance in A/B Testing (Explained)

CRO Audits Team

“Version B won with 95% statistical significance!”

This phrase gets thrown around constantly in CRO. But what does it actually mean? And why does it matter?

This guide explains statistical significance in plain language, why it’s essential for valid A/B testing, and the common mistakes that invalidate results.

What Statistical Significance Actually Means

Statistical significance answers one question: Could this result have occurred by random chance?

When you run an A/B test and B outperforms A, there are two possibilities:

  1. B is genuinely better
  2. The difference is random luck

Statistical significance tells you how confident you can be that it’s #1, not #2.

The 95% Confidence Standard

When we say a result is “statistically significant at 95% confidence,” we mean:

If there were truly NO difference between A and B, there’s only a 5% probability we’d see a difference this large by random chance.

It’s NOT saying “95% chance B is better.” It’s saying “95% confident this isn’t just noise.”

An Analogy: Coin Flipping

Imagine testing whether a coin is fair by flipping it 20 times.

Result: 12 heads, 8 tails

Is the coin biased? Probably not—getting 12/20 heads with a fair coin is quite likely (about 12% probability). This result is NOT statistically significant.

Different result: 18 heads, 2 tails

Now we’re suspicious. Getting 18+ heads with a fair coin happens less than 0.1% of the time. This IS statistically significant—something’s probably unusual about the coin.

A/B testing works the same way. We’re asking: “Is this difference unusual enough that it’s probably real?”

P-Values: The Probability of Being Wrong

P-values quantify statistical significance. They represent the probability of seeing your result (or more extreme) if there were truly no difference.

P-value interpretation:

  • p = 0.50: 50% chance of this result with no real difference (not significant)
  • p = 0.10: 10% chance (not significant by standard criteria)
  • p = 0.05: 5% chance (significant at 95% confidence)
  • p = 0.01: 1% chance (significant at 99% confidence)

The threshold:

  • p < 0.05 → Statistically significant (industry standard)
  • p > 0.05 → Not statistically significant

What P-Values Are NOT

P-value is NOT:

  • The probability that B is better than A
  • The probability that your result is correct
  • The size of the improvement

A p-value of 0.01 doesn’t mean “99% chance B is better.” It means “if B weren’t better, there’s only 1% chance we’d see this result.”

Confidence Intervals: The Range of Truth

Confidence intervals provide more information than p-values alone. They show the range where the true effect likely falls.

Example: “Version B improved conversion rate by 15% (95% CI: 8% to 22%)”

This means we’re 95% confident the true improvement is somewhere between 8% and 22%.

Interpreting Confidence Intervals

Narrow interval (5% to 7%): Precise estimate, probably need larger sample Wide interval (-2% to 25%): Imprecise, need more data Interval including zero (-5% to 10%): Not significant—the true effect might be negative, zero, or positive

Why Intervals Are Useful

P-values give binary answers (significant or not). Confidence intervals show:

  • Best estimate of the effect
  • Range of plausible values
  • Precision of the estimate

A “significant” result with 95% CI of 0.1% to 30% is very different from one with 95% CI of 14% to 16%.

Statistical Power: Finding Real Effects

Power is the probability of detecting a real effect if one exists.

Standard power: 80%

This means: if B truly is 10% better, you have an 80% chance of detecting it (and 20% chance of missing it).

Why Power Matters

Low power means:

  • More false negatives (missing real improvements)
  • Wasted tests that conclude “no difference” when there is one
  • Slower optimization progress

Increasing Power

Power depends on:

  • Sample size: Larger samples = higher power
  • Effect size: Larger effects are easier to detect
  • Variance: Less noisy data = higher power
  • Significance level: Lower threshold (99% vs 95%) = lower power

The main lever is sample size. More traffic = more power.

The Two Types of Errors

Type I Error (False Positive)

Concluding there’s a difference when there isn’t one.

Controlled by: Significance level (typically 5%) Consequence: Implementing changes that don’t actually work

Type II Error (False Negative)

Concluding there’s no difference when there actually is one.

Controlled by: Statistical power (typically 80%) Consequence: Missing real improvements

Real Effect ExistsNo Real Effect
Test says “significant”✅ Correct❌ Type I Error
Test says “not significant”❌ Type II Error✅ Correct

Both errors have costs. Testing methodology balances between them.

Why People Get This Wrong

Mistake 1: Stopping Tests Early

You check your test after 3 days. B is “winning at 95% significance.” Ship it!

The problem: This isn’t actually 95% confidence.

When you check repeatedly and stop when you see significance, you’re essentially doing multiple comparisons. Each check has a chance of false positive. The cumulative false positive rate can exceed 30%.

Real world: The “significant” result you saw at day 3 might reverse by day 14.

Solution: Calculate sample size before testing. Run to completion regardless of interim results.

Mistake 2: Confusing Significance and Importance

A test shows B improves conversion by 0.1% with p = 0.001.

Statistically significant? Yes. Practically important? Maybe not.

The distinction:

  • Statistical significance: Is it real?
  • Practical significance: Does it matter?

With enough data, tiny differences become statistically significant. That doesn’t mean they’re worth implementing.

Mistake 3: Ignoring Power

Your test ran for 2 weeks and showed no significant difference. You conclude the change doesn’t work.

The problem: With a small sample, you might have only 30% power. There’s a 70% chance you’d miss a real 10% improvement.

Solution: Calculate required sample size for desired power BEFORE testing. If you can’t reach it, acknowledge the limitation.

Mistake 4: P-Hacking

You run a test. Not significant overall. But when you look at mobile users only, it’s significant! Ship it for mobile!

The problem: When you test multiple segments, some will show “significance” by chance. This is called p-hacking or multiple comparisons.

The math: Test 20 segments, and even with no real effect, you’ll likely see “significance” in at least one.

Solution: Pre-specify your segments. Apply corrections for multiple comparisons (like Bonferroni). Or treat segment findings as hypotheses for future tests.

Mistake 5: Treating 95% as Magic

Why 95%? It’s a convention, not a law of physics.

Sometimes 90% is fine (exploratory tests, low-stakes decisions). Sometimes you need 99% (major business decisions, one-way doors).

Match the confidence level to the decision stakes.

Reading A/B Test Results

When your testing tool reports results, look for:

Essential Metrics

  1. Conversion rates: A vs. B actual performance
  2. Relative difference: B is X% better/worse than A
  3. Statistical significance: Confidence level or p-value
  4. Confidence interval: Range of likely true effect

Example Result

“Version B: 3.5% conversion (vs. 3.0% control) Relative lift: +16.7% Statistical significance: 95% 95% Confidence interval: +8% to +26%”

Interpretation: B likely improves conversion. The true lift is probably between 8% and 26%, with 16.7% our best estimate. We’re 95% confident this isn’t random noise.

Red Flags

  • No confidence interval reported
  • Significance claimed with very small samples
  • Multiple variants tested without adjustment
  • Test stopped early at first sign of significance

Practical Guidelines

Before Testing

  1. Calculate sample size for your desired power (80%) and MDE
  2. Determine test duration based on traffic
  3. Pre-specify primary metric and any segments
  4. Commit to running the full test

During Testing

  1. Monitor for technical issues only
  2. Don’t peek at statistical results daily
  3. Don’t stop early even if results look good
  4. Don’t add variants mid-test

After Testing

  1. Check significance at predetermined endpoint
  2. Review confidence intervals for practical importance
  3. Check segments for consistency (flag for future testing if inconsistent)
  4. Document everything including non-significant results

When 95% Isn’t Appropriate

Use Lower Confidence (90%) When:

  • Testing is exploratory
  • Changes are easily reversible
  • Stakes are low
  • You need faster learning cycles

Use Higher Confidence (99%) When:

  • Major business decisions
  • Hard-to-reverse changes
  • High implementation cost
  • Leadership needs extra certainty

Summary

Statistical significance tells you whether an observed difference is likely real or just random chance. Getting it right requires:

  1. Adequate sample size before concluding anything
  2. Proper test duration without early stopping
  3. Understanding what p-values mean (and don’t mean)
  4. Considering practical significance alongside statistical
  5. Pre-specifying your analysis to avoid p-hacking

The math exists to protect you from making decisions based on noise. Respect it.

Ready to Improve Your Conversions?

Get a comprehensive CRO audit with actionable insights you can implement right away.

Request Your Audit — $2,500

Ready to optimize your conversions?

Get personalized, data-driven recommendations for your website.

Request Your Audit — $2,500