Free tool · Real statistics
Is your A/B winner
real or luck?
That 'winning' subject line might just be lucky. Run the same z-test a stats package would — and learn how many sends you actually need before testing.
No winner yet — keep sending
Confidence is 85.8% — below the 95% bar. The difference you see could still be random noise.
Variant A
3.0%
Variant B
4.8%
- Relative lift (B vs A)
- +60.0%
- Confidence
- 85.8%
- Bar to call a winner
- 95%
Two-proportion z-test, two-tailed. One honest caveat: peeking at results daily and stopping the moment you cross 95% inflates false positives — decide your sample size first, then test.
The three ways A/B tests lie to you
Small samples
At cold email volumes, most 'wins' are noise. A 3% vs 4% reply-rate difference needs roughly two thousand sends per variant to confirm — far more than most tests ever get.
Peeking
Checking daily and stopping at the first significant readout doubles your false positives. Fix the sample size first; look once.
Bot-inflated opens
Testing on raw opens means testing on Apple's proxy servers and security scanners. Test on replies — or on opens verified as human.
Questions, answered honestly
What does 95% confidence actually mean?
That if there were truly no difference between your variants, a gap this large would show up by random chance less than 5% of the time. It does NOT mean variant B is 95% likely to be better — but as a practical decision bar, 95% keeps you from shipping noise.
Why does my obvious winner say 'not significant'?
Sample size. 5 replies vs 8 replies looks like a 60% lift, but with 100 sends per variant it's statistically indistinguishable from luck. Reply-rate differences are small in absolute terms (2% vs 3%), and small absolute differences need surprisingly large samples — that's what the planner on the left computes.
Which metric should I test on — opens, clicks or replies?
Replies, almost always. Opens are polluted by bots and Apple's proxy (often 30–60% of 'opens' aren't human), and clicks are rare in cold email. Replies are the metric that pays you and the one spam filters can't fake. The catch: replies are rarest, so they need the biggest samples.
What's wrong with checking results every day and stopping at 95%?
It's called peeking, and it roughly doubles your false-positive rate: across many looks, random noise will cross the 95% line at least once even with no real difference. The honest workflow is the one the planner supports — decide the sample size first, run until you hit it, then look.
How does Norbelys decide A/B winners?
The engine holds a configurable test percentage of sends across variants, requires a minimum sample per variant, and picks winners on replies (or human-verified opens) — then routes the remaining volume to the winner automatically. This calculator runs the same kind of math, by hand.
Let the engine run the test for you.
Norbelys runs A/B tests on live campaigns with minimum samples enforced and winners picked on real replies — then shifts the remaining volume automatically. Statistics included, spreadsheet not required.
Start sendingFrom $29/mo · Cancel anytime