Digital Marketing

A/B Test Calculator

Statistical Significance, Sample Size & Bayesian Analysis

Run A/B (AB) tests with both frequentist (z-test, chi-square, p-value) and Bayesian (Beta distribution) analyses. Calculate required sample size, minimum detectable effect, test duration, and the probability that variant B truly beats the control.

Instant Results100% FreeAny DeviceNo Sign-up

Control (A)

Visitors

Conversions

Variant (B)

Visitors

Conversions

Overview

What Is the A/B Test Calculator?

An A/B test (split test) is a controlled experiment that compares two versions of a page, email, ad, or feature to determine which one drives better outcomes. This calculator supports both the analysis phase (reading your test results) and the planning phase (deciding how long to run the test).

▸Analyze Results tab: Enter raw visitor and conversion counts for control (A) and variant (B) to get statistical significance, Z-score, chi-square, p-value, and Bayesian probability B beats A.
▸Plan Test tab: Enter your baseline conversion rate, minimum detectable effect (MDE), desired statistical power, and daily traffic to calculate required sample size and test duration.
▸Relative uplift: The percentage improvement of variant B over control A — e.g., a lift from 3% to 3.6% CVR is a 20% relative uplift, not a 0.6% uplift.
▸Expected loss: The Bayesian expected loss if you choose B estimates how much conversion rate you'd forgo if B turns out to be worse than A. A small expected loss means it's safe to ship even before reaching 95% significance.

How it works

A/B Test Calculator Formula and Method

This calculator uses two complementary statistical frameworks: frequentist hypothesis testing and Bayesian probability estimation.

1Z-Test for Two Proportions

z = (p_B − p_A) / √[p̄(1−p̄)(1/n_A + 1/n_B)]

p̄ = pooled conversion rate. Two-tailed p-value from normal CDF. Significance at p < 0.05 (95% confidence).

2Chi-Square Test

X² = Σ(O−E)²/E (2×2 contingency table)

Observed vs expected counts in a 2×2 table. Equivalent to Z² for two proportions with equal sample sizes.

3Bayesian (Beta Distribution)

A ~ Beta(conversions_A+1, non-conv_A+1) B ~ Beta(conversions_B+1, non-conv_B+1)

P(B>A) estimated via 10,000 Monte Carlo samples. Expected loss if choosing B = E[max(p_A − p_B, 0)].

4Sample Size (Planning)

n = (z_α/2 + z_β)² × p(1−p) / δ²

δ = MDE × baseline rate. z_α/2 = 1.96 (95% conf), z_β = 0.842 (80% power) or 1.282 (90% power).

Frequentist vs Bayesian: The frequentist p-value answers "if there were no real difference, how likely is this data?" The Bayesian probability directly answers "given this data, what is the probability that B is actually better than A?" — which is what most practitioners actually want to know.

Instructions

How to Use

1
Choose Analyze Results to evaluate an existing test or Plan Test to size a new one.
2
Analyze: enter total visitors for both control (A) and variant (B).
3
Analyze: enter conversion counts for A and B (must be ≤ visitor counts).
4
Review the significance verdict, Z-score, p-value, and relative uplift percentage.
5
Check the Bayesian probability bar for the probability that B outperforms A.
6
Plan: enter your current baseline CVR from Google Analytics or similar.
7
Plan: enter your minimum detectable effect (MDE) as a relative percentage change.
8
Plan: select statistical power (80% standard, 90% high-stakes) and enter daily traffic to get required sample size and test duration.

Real example

A/B Test Calculator Example

Example 1 | Analyze: E-commerce checkout button test

Control visitors5,000

Control conversions175 (3.5% CVR)

Variant visitors5,000

Variant conversions220 (4.4% CVR)

Relative uplift+25.7%

Z-score2.87

p-value0.0041

Significant?Yes (p < 0.05)

P(B > A)~99.6%

Clear winner. At p=0.004 and 99.6% Bayesian probability, you can ship the variant with high confidence.

Example 2 | Plan: Landing page headline test

Baseline CVR2.8%

Min detectable effect15% relative

Power80%

Confidence95%

Daily traffic800

Sample per variant~4,200

Total sample~8,400

Test duration~11 days

A 15% relative improvement means detecting 2.8% → 3.22% CVR. At 800 daily visitors, a clean result takes about 11 days.

Guide

Understanding A/B Test

Frequentist vs Bayesian: When to Use Which

Criterion	Frequentist (p-value)	Bayesian (Beta distribution)
Question answered	Is the data surprising under H₀?	What is the probability B > A?
Stopping rule	Must set sample size in advance	Flexible (with caution)
Interpretability	Counterintuitive	Intuitive
Business decision	Binary: significant / not	Continuous: probability + expected loss
Small samples	Unreliable	Handles well via conjugate prior

Common A/B Testing Mistakes

▸Peeking: Checking results daily and stopping at significance. This inflates Type I error rate from 5% to 20%+ for long tests.
▸Multiple comparisons: Testing 5 variants at once without a Bonferroni correction means one false positive is expected even with no real differences.
▸Short tests: Running tests for less than one full business cycle (2 weeks minimum). Day-of-week effects can create spurious results in 3–5 day tests.
▸Low traffic: Testing with fewer visitors than the required sample gives underpowered tests that detect only very large effects or produce false positives.
▸Testing too much simultaneously: Running overlapping experiments without proper controls can create interaction effects that invalidate both tests.

What Makes a Good A/B Test Hypothesis

A testable hypothesis should follow the format: "Changing X to Y will increase Z because of reason W." Vague tests ("let's try a different headline") produce results that are hard to generalize. A good hypothesis: "Changing the CTA from 'Submit' to 'Get My Free Quote' will increase form submissions because it communicates value rather than effort."

Sample Size Reference Table

Baseline CVR	MDE 10%	MDE 15%	MDE 20%	MDE 30%
1%	~147,000	~65,000	~37,000	~16,500
2%	~73,000	~32,500	~18,300	~8,100
3%	~48,000	~21,500	~12,100	~5,400
5%	~28,500	~12,700	~7,100	~3,200
10%	~13,500	~6,000	~3,400	~1,500

Per-variant sample sizes at 95% confidence, 80% power.

Rule of thumb for test duration: Required days = Total sample size ÷ (Daily traffic × % allocated to test). Always round up and add a full business cycle buffer. Never stop a test on a Monday morning based on weekend data.

Answers

Frequently Asked Questions

What does "statistically significant" actually mean?

It means that if there were truly no difference between A and B, the probability of observing results this extreme or more extreme by random chance is less than 5% (for 95% confidence). It does NOT mean the difference is practically important or that B will always outperform A in production.

What is the Minimum Detectable Effect (MDE) and how do I choose it?

The MDE is the smallest improvement you consider worth detecting. A smaller MDE requires a larger sample (longer test). For low-traffic sites, set MDE at 15–20% relative to make tests feasible. For high-traffic sites, you can detect 5–10% effects. A useful heuristic: the MDE should represent the minimum improvement that would justify keeping the variant permanently.

How is Bayesian probability different from the frequentist p-value?

A frequentist p-value of 0.05 says: "if there were no effect, this data would occur only 5% of the time." A Bayesian probability of 95% says: "given this data, there is a 95% chance B is actually better than A." Most practitioners find the Bayesian interpretation more actionable, though it depends on the prior assumption (this calculator uses a uniform/flat prior via Beta(1,1)).

Can I stop a test early if it reaches significance?

Peeking at results and stopping when significance is reached dramatically inflates false positive rates. If you check results daily with a 5% threshold, your true false positive rate can exceed 20–30%. Solutions include setting a fixed end date before launch, using sequential testing methods, or applying Bayesian decision rules based on expected loss rather than p-value thresholds.

What is statistical power and why does it matter?

Statistical power (1−β) is the probability of correctly detecting a real difference when one exists. At 80% power, you have a 20% chance of missing a true effect (Type II error). Higher power requires larger samples. Use 80% for exploratory tests, 90% for critical decisions like checkout flows or pricing pages.

Related Calculators

CRO Revenue Calculator

Calculate the annual revenue impact of improving your conversion rate. Model A/B test lift-to-revenue translation, compute the paid traffic equivalent of any conversion gain, and determine the break-even period for CRO tool and agency investment.

Marketing Funnel Calculator

Map your complete marketing funnel from website visitors through leads, MQLs, SQLs, opportunities, and closed-won revenue. Calculate cost per lead at each stage, identify the highest-impact conversion leaks, and model the revenue impact of improving any stage.

Z-Score Calculator

Calculate z-score, percentile rank, and p-values from any data point, mean, and standard deviation. Includes reverse calculation and step-by-step solutions.

Probability Calculator

Calculate single event, multiple event, conditional, and complementary probabilities.

Standard Deviation Calculator

Calculate standard deviation, variance, and other statistics for a dataset.