A/B Test Calculator | Statistical Significance, Sample Size & Bayesian Analysis
Run A/B tests with both frequentist (z-test, chi-square, p-value) and Bayesian (Beta distribution) analyses. Calculate required sample size, minimum detectable effect, test duration, and the probability that variant B truly beats the control.
Control (A)
Variant (B)
What Is the A/B Test Calculator | Statistical Significance, Sample Size & Bayesian Analysis?
An A/B test (split test) is a controlled experiment that compares two versions of a page, email, ad, or feature to determine which one drives better outcomes. This calculator supports both the analysis phase (reading your test results) and the planning phase (deciding how long to run the test).
- ▸Analyze Results tab: Enter raw visitor and conversion counts for control (A) and variant (B) to get statistical significance, Z-score, chi-square, p-value, and Bayesian probability B beats A.
- ▸Plan Test tab: Enter your baseline conversion rate, minimum detectable effect (MDE), desired statistical power, and daily traffic to calculate required sample size and test duration.
- ▸Relative uplift: The percentage improvement of variant B over control A — e.g., a lift from 3% to 3.6% CVR is a 20% relative uplift, not a 0.6% uplift.
- ▸Expected loss: The Bayesian expected loss if you choose B estimates how much conversion rate you'd forgo if B turns out to be worse than A. A small expected loss means it's safe to ship even before reaching 95% significance.
Formula
This calculator uses two complementary statistical frameworks: frequentist hypothesis testing and Bayesian probability estimation.
z = (p_B − p_A) / √[p̄(1−p̄)(1/n_A + 1/n_B)]
p̄ = pooled conversion rate. Two-tailed p-value from normal CDF. Significance at p < 0.05 (95% confidence).
X² = Σ(O−E)²/E (2×2 contingency table)
Observed vs expected counts in a 2×2 table. Equivalent to Z² for two proportions with equal sample sizes.
A ~ Beta(conversions_A+1, non-conv_A+1) B ~ Beta(conversions_B+1, non-conv_B+1)
P(B>A) estimated via 10,000 Monte Carlo samples. Expected loss if choosing B = E[max(p_A − p_B, 0)].
n = (z_α/2 + z_β)² × p(1−p) / δ²
δ = MDE × baseline rate. z_α/2 = 1.96 (95% conf), z_β = 0.842 (80% power) or 1.282 (90% power).
How to Use
- 1
Choose Analyze Results to evaluate an existing test or Plan Test to size a new one.
- 2
Analyze: enter total visitors for both control (A) and variant (B).
- 3
Analyze: enter conversion counts for A and B (must be ≤ visitor counts).
- 4
Review the significance verdict, Z-score, p-value, and relative uplift percentage.
- 5
Check the Bayesian probability bar for the probability that B outperforms A.
- 6
Plan: enter your current baseline CVR from Google Analytics or similar.
- 7
Plan: enter your minimum detectable effect (MDE) as a relative percentage change.
- 8
Plan: select statistical power (80% standard, 90% high-stakes) and enter daily traffic to get required sample size and test duration.
- 1
Select the Analyze Results tab to evaluate a completed or running test, or Plan Test to determine required sample size before launching.
- 2
For analysis: enter total visitors and conversion counts for both control (A) and variant (B). Conversions must not exceed visitors.
- 3
Review the significance verdict — green means the result is statistically significant at 95% confidence (p < 0.05).
- 4
Check the relative uplift percentage — this is the practical impact, not just statistical significance.
- 5
Use the Bayesian probability bar to see the direct probability that B outperforms A given the data.
- 6
For planning: enter your current baseline CVR (from analytics), the minimum improvement you want to detect (MDE), and your daily traffic volume.
- 7
Select 80% power for standard tests or 90% for high-stakes decisions like pricing or checkout changes.
- 8
Use the duration output to schedule your test start and end dates, and always run for at least 2 full weeks.
Example Calculation
Example 1 | Analyze: E-commerce checkout button test
Clear winner. At p=0.004 and 99.6% Bayesian probability, you can ship the variant with high confidence.
Example 2 | Plan: Landing page headline test
A 15% relative improvement means detecting 2.8% → 3.22% CVR. At 800 daily visitors, a clean result takes about 11 days.
Understanding A/B Test | Statistical Significance, Sample Size & Bayesian Analysis
Frequentist vs Bayesian: When to Use Which
| Criterion | Frequentist (p-value) | Bayesian (Beta distribution) |
|---|---|---|
| Question answered | Is the data surprising under H₀? | What is the probability B > A? |
| Stopping rule | Must set sample size in advance | Flexible (with caution) |
| Interpretability | Counterintuitive | Intuitive |
| Business decision | Binary: significant / not | Continuous: probability + expected loss |
| Small samples | Unreliable | Handles well via conjugate prior |
Common A/B Testing Mistakes
- ▸Peeking: Checking results daily and stopping at significance. This inflates Type I error rate from 5% to 20%+ for long tests.
- ▸Multiple comparisons: Testing 5 variants at once without a Bonferroni correction means one false positive is expected even with no real differences.
- ▸Short tests: Running tests for less than one full business cycle (2 weeks minimum). Day-of-week effects can create spurious results in 3–5 day tests.
- ▸Low traffic: Testing with fewer visitors than the required sample gives underpowered tests that detect only very large effects or produce false positives.
- ▸Testing too much simultaneously: Running overlapping experiments without proper controls can create interaction effects that invalidate both tests.
What Makes a Good A/B Test Hypothesis
A testable hypothesis should follow the format: "Changing X to Y will increase Z because of reason W." Vague tests ("let's try a different headline") produce results that are hard to generalize. A good hypothesis: "Changing the CTA from 'Submit' to 'Get My Free Quote' will increase form submissions because it communicates value rather than effort."
Sample Size Reference Table
| Baseline CVR | MDE 10% | MDE 15% | MDE 20% | MDE 30% |
|---|---|---|---|---|
| 1% | ~147,000 | ~65,000 | ~37,000 | ~16,500 |
| 2% | ~73,000 | ~32,500 | ~18,300 | ~8,100 |
| 3% | ~48,000 | ~21,500 | ~12,100 | ~5,400 |
| 5% | ~28,500 | ~12,700 | ~7,100 | ~3,200 |
| 10% | ~13,500 | ~6,000 | ~3,400 | ~1,500 |
Per-variant sample sizes at 95% confidence, 80% power.
Frequently Asked Questions
What does "statistically significant" actually mean?
It means that if there were truly no difference between A and B, the probability of observing results this extreme or more extreme by random chance is less than 5% (for 95% confidence). It does NOT mean the difference is practically important or that B will always outperform A in production.
What is the Minimum Detectable Effect (MDE) and how do I choose it?
The MDE is the smallest improvement you consider worth detecting. A smaller MDE requires a larger sample (longer test). For low-traffic sites, set MDE at 15–20% relative to make tests feasible. For high-traffic sites, you can detect 5–10% effects. A useful heuristic: the MDE should represent the minimum improvement that would justify keeping the variant permanently.
How is Bayesian probability different from the frequentist p-value?
A frequentist p-value of 0.05 says: "if there were no effect, this data would occur only 5% of the time." A Bayesian probability of 95% says: "given this data, there is a 95% chance B is actually better than A." Most practitioners find the Bayesian interpretation more actionable, though it depends on the prior assumption (this calculator uses a uniform/flat prior via Beta(1,1)).
Can I stop a test early if it reaches significance?
Peeking at results and stopping when significance is reached dramatically inflates false positive rates. If you check results daily with a 5% threshold, your true false positive rate can exceed 20–30%. Solutions include setting a fixed end date before launch, using sequential testing methods, or applying Bayesian decision rules based on expected loss rather than p-value thresholds.
What is statistical power and why does it matter?
Statistical power (1−β) is the probability of correctly detecting a real difference when one exists. At 80% power, you have a 20% chance of missing a true effect (Type II error). Higher power requires larger samples. Use 80% for exploratory tests, 90% for critical decisions like checkout flows or pricing pages.
You Might Also Like
Explore 360+ Free Calculators
From math and science to finance and everyday life — all free, no account needed.