Skip to content

Free tool · Real statistics

Is your A/B winner
real or luck?

That 'winning' subject line might just be lucky. Run the same z-test a stats package would — and learn how many sends you actually need before testing.

Your results

Works for any binary outcome — replies, clicks or human opens.

Planning a test instead?

3%
+50%

You need about 2,070 sends per variant to reliably detect a +50% lift from a 3% baseline.

No winner yet — keep sending

Confidence is 85.8% — below the 95% bar. The difference you see could still be random noise.

Variant A

3.0%

Variant B

4.8%

Relative lift (B vs A)
+60.0%
Confidence
85.8%
Bar to call a winner
95%

Two-proportion z-test, two-tailed. One honest caveat: peeking at results daily and stopping the moment you cross 95% inflates false positives — decide your sample size first, then test.

The three ways A/B tests lie to you

Small samples

At cold email volumes, most 'wins' are noise. A 3% vs 4% reply-rate difference needs roughly two thousand sends per variant to confirm — far more than most tests ever get.

Peeking

Checking daily and stopping at the first significant readout doubles your false positives. Fix the sample size first; look once.

Bot-inflated opens

Testing on raw opens means testing on Apple's proxy servers and security scanners. Test on replies — or on opens verified as human.

Questions, answered honestly

What does 95% confidence actually mean?

That if there were truly no difference between your variants, a gap this large would show up by random chance less than 5% of the time. It does NOT mean variant B is 95% likely to be better — but as a practical decision bar, 95% keeps you from shipping noise.

Why does my obvious winner say 'not significant'?

Sample size. 5 replies vs 8 replies looks like a 60% lift, but with 100 sends per variant it's statistically indistinguishable from luck. Reply-rate differences are small in absolute terms (2% vs 3%), and small absolute differences need surprisingly large samples — that's what the planner on the left computes.

Which metric should I test on — opens, clicks or replies?

Replies, almost always. Opens are polluted by bots and Apple's proxy (often 30–60% of 'opens' aren't human), and clicks are rare in cold email. Replies are the metric that pays you and the one spam filters can't fake. The catch: replies are rarest, so they need the biggest samples.

What's wrong with checking results every day and stopping at 95%?

It's called peeking, and it roughly doubles your false-positive rate: across many looks, random noise will cross the 95% line at least once even with no real difference. The honest workflow is the one the planner supports — decide the sample size first, run until you hit it, then look.

How does Norbelys decide A/B winners?

The engine holds a configurable test percentage of sends across variants, requires a minimum sample per variant, and picks winners on replies (or human-verified opens) — then routes the remaining volume to the winner automatically. This calculator runs the same kind of math, by hand.

Let the engine run the test for you.

Norbelys runs A/B tests on live campaigns with minimum samples enforced and winners picked on real replies — then shifts the remaining volume automatically. Statistics included, spreadsheet not required.

Start sending

From $29/mo · Cancel anytime