DigitHelm
Digital Marketing

A/B Test Calculator | Statistical Significance, Sample Size & Bayesian Analysis

Run A/B tests with both frequentist (z-test, chi-square, p-value) and Bayesian (Beta distribution) analyses. Calculate required sample size, minimum detectable effect, test duration, and the probability that variant B truly beats the control.

Instant Results100% FreeAny DeviceNo Sign-up

Control (A)

Variant (B)

What Is the A/B Test Calculator | Statistical Significance, Sample Size & Bayesian Analysis?

An A/B test (split test) is a controlled experiment that compares two versions of a page, email, ad, or feature to determine which one drives better outcomes. This calculator supports both the analysis phase (reading your test results) and the planning phase (deciding how long to run the test).

  • Analyze Results tab: Enter raw visitor and conversion counts for control (A) and variant (B) to get statistical significance, Z-score, chi-square, p-value, and Bayesian probability B beats A.
  • Plan Test tab: Enter your baseline conversion rate, minimum detectable effect (MDE), desired statistical power, and daily traffic to calculate required sample size and test duration.
  • Relative uplift: The percentage improvement of variant B over control A — e.g., a lift from 3% to 3.6% CVR is a 20% relative uplift, not a 0.6% uplift.
  • Expected loss: The Bayesian expected loss if you choose B estimates how much conversion rate you'd forgo if B turns out to be worse than A. A small expected loss means it's safe to ship even before reaching 95% significance.

Formula

This calculator uses two complementary statistical frameworks: frequentist hypothesis testing and Bayesian probability estimation.

1Z-Test for Two Proportions

z = (p_B − p_A) / √[p̄(1−p̄)(1/n_A + 1/n_B)]

p̄ = pooled conversion rate. Two-tailed p-value from normal CDF. Significance at p < 0.05 (95% confidence).

2Chi-Square Test

X² = Σ(O−E)²/E (2×2 contingency table)

Observed vs expected counts in a 2×2 table. Equivalent to Z² for two proportions with equal sample sizes.

3Bayesian (Beta Distribution)

A ~ Beta(conversions_A+1, non-conv_A+1) B ~ Beta(conversions_B+1, non-conv_B+1)

P(B>A) estimated via 10,000 Monte Carlo samples. Expected loss if choosing B = E[max(p_A − p_B, 0)].

4Sample Size (Planning)

n = (z_α/2 + z_β)² × p(1−p) / δ²

δ = MDE × baseline rate. z_α/2 = 1.96 (95% conf), z_β = 0.842 (80% power) or 1.282 (90% power).

Frequentist vs Bayesian: The frequentist p-value answers "if there were no real difference, how likely is this data?" The Bayesian probability directly answers "given this data, what is the probability that B is actually better than A?" — which is what most practitioners actually want to know.

How to Use

  1. 1

    Choose Analyze Results to evaluate an existing test or Plan Test to size a new one.

  2. 2

    Analyze: enter total visitors for both control (A) and variant (B).

  3. 3

    Analyze: enter conversion counts for A and B (must be ≤ visitor counts).

  4. 4

    Review the significance verdict, Z-score, p-value, and relative uplift percentage.

  5. 5

    Check the Bayesian probability bar for the probability that B outperforms A.

  6. 6

    Plan: enter your current baseline CVR from Google Analytics or similar.

  7. 7

    Plan: enter your minimum detectable effect (MDE) as a relative percentage change.

  8. 8

    Plan: select statistical power (80% standard, 90% high-stakes) and enter daily traffic to get required sample size and test duration.

  1. 1

    Select the Analyze Results tab to evaluate a completed or running test, or Plan Test to determine required sample size before launching.

  2. 2

    For analysis: enter total visitors and conversion counts for both control (A) and variant (B). Conversions must not exceed visitors.

  3. 3

    Review the significance verdict — green means the result is statistically significant at 95% confidence (p < 0.05).

  4. 4

    Check the relative uplift percentage — this is the practical impact, not just statistical significance.

  5. 5

    Use the Bayesian probability bar to see the direct probability that B outperforms A given the data.

  6. 6

    For planning: enter your current baseline CVR (from analytics), the minimum improvement you want to detect (MDE), and your daily traffic volume.

  7. 7

    Select 80% power for standard tests or 90% for high-stakes decisions like pricing or checkout changes.

  8. 8

    Use the duration output to schedule your test start and end dates, and always run for at least 2 full weeks.

Example Calculation

Example 1 | Analyze: E-commerce checkout button test

Control visitors5,000
Control conversions175 (3.5% CVR)
Variant visitors5,000
Variant conversions220 (4.4% CVR)
Relative uplift+25.7%
Z-score2.87
p-value0.0041
Significant?Yes (p < 0.05)
P(B > A)~99.6%

Clear winner. At p=0.004 and 99.6% Bayesian probability, you can ship the variant with high confidence.

Example 2 | Plan: Landing page headline test

Baseline CVR2.8%
Min detectable effect15% relative
Power80%
Confidence95%
Daily traffic800
Sample per variant~4,200
Total sample~8,400
Test duration~11 days

A 15% relative improvement means detecting 2.8% → 3.22% CVR. At 800 daily visitors, a clean result takes about 11 days.

Understanding A/B Test | Statistical Significance, Sample Size & Bayesian Analysis

Frequentist vs Bayesian: When to Use Which

CriterionFrequentist (p-value)Bayesian (Beta distribution)
Question answeredIs the data surprising under H₀?What is the probability B > A?
Stopping ruleMust set sample size in advanceFlexible (with caution)
InterpretabilityCounterintuitiveIntuitive
Business decisionBinary: significant / notContinuous: probability + expected loss
Small samplesUnreliableHandles well via conjugate prior

Common A/B Testing Mistakes

  • Peeking: Checking results daily and stopping at significance. This inflates Type I error rate from 5% to 20%+ for long tests.
  • Multiple comparisons: Testing 5 variants at once without a Bonferroni correction means one false positive is expected even with no real differences.
  • Short tests: Running tests for less than one full business cycle (2 weeks minimum). Day-of-week effects can create spurious results in 3–5 day tests.
  • Low traffic: Testing with fewer visitors than the required sample gives underpowered tests that detect only very large effects or produce false positives.
  • Testing too much simultaneously: Running overlapping experiments without proper controls can create interaction effects that invalidate both tests.

What Makes a Good A/B Test Hypothesis

A testable hypothesis should follow the format: "Changing X to Y will increase Z because of reason W." Vague tests ("let's try a different headline") produce results that are hard to generalize. A good hypothesis: "Changing the CTA from 'Submit' to 'Get My Free Quote' will increase form submissions because it communicates value rather than effort."

Sample Size Reference Table

Baseline CVRMDE 10%MDE 15%MDE 20%MDE 30%
1%~147,000~65,000~37,000~16,500
2%~73,000~32,500~18,300~8,100
3%~48,000~21,500~12,100~5,400
5%~28,500~12,700~7,100~3,200
10%~13,500~6,000~3,400~1,500

Per-variant sample sizes at 95% confidence, 80% power.

Rule of thumb for test duration: Required days = Total sample size ÷ (Daily traffic × % allocated to test). Always round up and add a full business cycle buffer. Never stop a test on a Monday morning based on weekend data.

Frequently Asked Questions

What does "statistically significant" actually mean?

It means that if there were truly no difference between A and B, the probability of observing results this extreme or more extreme by random chance is less than 5% (for 95% confidence). It does NOT mean the difference is practically important or that B will always outperform A in production.

What is the Minimum Detectable Effect (MDE) and how do I choose it?

The MDE is the smallest improvement you consider worth detecting. A smaller MDE requires a larger sample (longer test). For low-traffic sites, set MDE at 15–20% relative to make tests feasible. For high-traffic sites, you can detect 5–10% effects. A useful heuristic: the MDE should represent the minimum improvement that would justify keeping the variant permanently.

How is Bayesian probability different from the frequentist p-value?

A frequentist p-value of 0.05 says: "if there were no effect, this data would occur only 5% of the time." A Bayesian probability of 95% says: "given this data, there is a 95% chance B is actually better than A." Most practitioners find the Bayesian interpretation more actionable, though it depends on the prior assumption (this calculator uses a uniform/flat prior via Beta(1,1)).

Can I stop a test early if it reaches significance?

Peeking at results and stopping when significance is reached dramatically inflates false positive rates. If you check results daily with a 5% threshold, your true false positive rate can exceed 20–30%. Solutions include setting a fixed end date before launch, using sequential testing methods, or applying Bayesian decision rules based on expected loss rather than p-value thresholds.

What is statistical power and why does it matter?

Statistical power (1−β) is the probability of correctly detecting a real difference when one exists. At 80% power, you have a 20% chance of missing a true effect (Type II error). Higher power requires larger samples. Use 80% for exploratory tests, 90% for critical decisions like checkout flows or pricing pages.

You Might Also Like

Explore 360+ Free Calculators

From math and science to finance and everyday life — all free, no account needed.