A/B testing is the practice of comparing two variants of a marketing element (an email subject line, a landing page headline, a button color, an ad creative) by showing each version to a portion of the audience and measuring which one performs better against a defined goal. The methodology has roots in agricultural experiments from the 1920s and has been a digital-marketing staple for over twenty years. Done well, A/B testing gives marketing teams an evidence-based way to improve campaigns and a discipline that protects against the natural human bias toward confirming what you already believe. Done badly, A/B testing produces noisy results that get interpreted as signal and lead teams astray.
This post walks through what A/B testing actually is, how a proper test is structured, the statistical concepts that matter (without the jargon), the most common pitfalls, and what to test (and not test) for marketing teams just getting started.
What A/B testing actually is
In its simplest form, A/B testing splits an audience into two roughly equal groups: group A sees the current version (the "control"), group B sees the modified version (the "variant"). Both groups have the same opportunity to take some desired action (click, sign up, buy, scroll, whatever the test is measuring). After enough people have been exposed and acted, the team compares the conversion rates of the two groups and decides whether the variant outperformed the control by a meaningful margin.
The discipline matters because the alternative (changing things and hoping they work) doesn’t produce reliable learning. With proper testing, you know whether the change helped, hurt, or made no measurable difference. Without it, every change is a guess defended by post-hoc rationalization.
A/B testing is most useful in contexts with enough traffic and conversion volume to produce statistically meaningful results in reasonable time. High-traffic web pages, large email lists, and active advertising campaigns all generally fit. Low-traffic pages and small lists often don’t produce enough data to test meaningfully, regardless of the testing tool.
How a proper A/B test is structured
A well-designed A/B test has six parts that need to be in place before the test starts running.
A specific hypothesis. "Changing the call-to-action button from gray to green will increase clicks by at least 10%" is a hypothesis. "Let’s try a green button and see what happens" is not. The hypothesis names the change, the expected direction of effect, and the rough magnitude.
A defined metric. What are you measuring? Click-through rate? Conversion rate? Revenue per visitor? Engagement time? Different metrics can give contradictory answers about the same change. Pick the metric that actually matters for the business outcome before the test starts.
A required sample size. Statistical significance requires enough observations to distinguish real effects from random noise. Calculate the required sample size before the test starts (online calculators are widely available); a test that ends before it reaches the required size cannot produce a trustworthy result.
A single variable change between control and variant. If you change the headline AND the button AND the image at the same time, you can’t tell which change drove the difference. Test one thing at a time, or use multivariate testing methods that explicitly handle multiple simultaneous changes.
Random and balanced audience assignment. Each visitor should have an equal chance of seeing the control or the variant, and the assignment should be independent of any factor that might affect the outcome. Most testing platforms handle this automatically; rolling your own with manual cohort assignment is error-prone.
A pre-defined success threshold. "If the variant beats the control by at least X% with at least 95% statistical confidence, we’ll ship the variant." Defining the success threshold in advance protects against the temptation to keep running the test until it produces the answer you wanted.
The statistical concepts that matter (in plain English)
You don’t need to be a statistician to run useful A/B tests, but a few concepts come up consistently and deserve to be understood.
Statistical significance is the probability that the difference you observed between control and variant is real rather than random noise. A test result is typically called "statistically significant at 95% confidence" if there’s less than a 5% chance the observed difference happened by chance. The 95% threshold is convention, not law; for higher-stakes decisions, 99% confidence is more conservative.
Sample size is the number of observations (visitors, recipients, clicks) the test collects. Larger samples can detect smaller effects with confidence; smaller samples can only detect large effects reliably. The required sample size depends on the baseline conversion rate, the minimum detectable effect, and the confidence level you require.
Statistical power is the probability that a real effect of a given size will actually be detected by the test. A test with low power may miss real effects (false negatives). Power calculations help avoid running tests that are too small to produce useful answers.
Effect size is how big the difference between control and variant is. Tests can show "statistically significant" results for differences too small to matter operationally. A 0.1% increase in conversion rate at very high traffic can be statistically significant and operationally trivial.
Confidence intervals express the range within which the true effect probably lies. "The variant outperformed the control by 7% with a 95% confidence interval of 3% to 11%" tells you not just the point estimate but the uncertainty around it.
The thing to remember: statistical tools tell you whether a difference is likely real, not whether it’s important. The judgment of importance is separate and is yours to make.
Common A/B testing pitfalls
The most common ways A/B tests produce misleading results:
Ending the test early. Watching the test mid-run and stopping when the variant happens to be ahead is a near-guaranteed way to find effects that aren’t real. Set the sample size in advance, run to that size, then stop.
Running too many tests at once. Multiple simultaneous tests on overlapping audiences interact in ways that confuse results. Sequence tests, or use proper multivariate methods if you must run many at once.
Testing trivial changes that won’t change anything. Button color changes occasionally matter; usually they don’t. Testing things that have small possible effects requires huge samples to detect; consider whether the test is worth the traffic.
Ignoring novelty effects. A visually striking variant might outperform the control because it’s new, with the effect fading as the variant becomes the new normal. Long-running tests (weeks rather than days) help separate novelty from durable performance.
Sample contamination. Visitors who see the variant in one session and the control in another aren’t getting a clean test. Most testing platforms handle this with consistent cookie-based assignment; verify your tool actually does this.
Cherry-picking metrics. Running a test, declaring the variant won on one metric, and ignoring that it lost on another. Define the metric in advance and live with the answer it gives.
Misinterpreting "no significant difference" as "no difference." A test that didn’t find a significant difference might mean the two versions are genuinely similar OR that the test didn’t have enough power to detect a real but smaller difference. Power analysis tells you which.
Testing on too-small audiences. Small lists or low-traffic pages can rarely produce statistically meaningful results in reasonable time. Some marketing programs aren’t big enough to A/B test rigorously; for those programs, careful qualitative judgment beats fake-rigor testing.
What to test (and what not to)
Test things with big potential impact: headlines, value propositions, calls to action, offer structure, page layout, email subject lines, ad creative direction, pricing presentation. These have demonstrated large effects in many tested contexts.
Test things that contradict your intuition: when the team disagrees about which approach will work, testing produces a real answer instead of letting the loudest opinion win.
Don’t test things that are too small to move the needle. Tiny color variations, single-word copy changes in low-visibility places, minor layout shifts. These rarely produce detectable effects, and you’ll exhaust your testing capacity on tests that don’t matter.
Don’t test things that are obvious. Some changes are clearly better and don’t need a test (fixing broken links, removing typos, complying with legal requirements). Save testing capacity for genuine unknowns.
Don’t test critical fix-ASAP changes. If a security or compliance issue requires an immediate change, ship it. Testing is for optimization, not for emergency response.
Don’t test on audiences too small for the methodology. A small business with 200 monthly visitors and 2 monthly conversions cannot produce statistically meaningful test results in reasonable time, regardless of testing tool sophistication.
A simple starting framework
For a marketing team new to A/B testing, the realistic ramp:
- Identify a high-impact page or asset with enough traffic or conversion volume to produce meaningful results. The homepage, the main signup page, the top-of-funnel ad, or the highest-volume email all tend to qualify.
- Pick one element with a clear hypothesis for improvement. Headline rewrite, call-to-action change, value proposition reframe.
- Use a real testing tool. Google Optimize was a common starter tool before its sunset; current options include VWO, Optimizely, AB Tasty, Convert, and many platform-native options (built into Mailchimp for emails, into ad platforms for creatives).
- Calculate sample size in advance using an online calculator. Most tools have one built in.
- Run the test to completion. Don’t peek and stop early.
- Document the result regardless of whether it was significant. Tests that don’t produce winners are still useful learning if you record what was tested and why it didn’t move the metric.
- Build a backlog of test ideas and work through it systematically. The compounding learning from a year of disciplined testing produces meaningful improvement over a year of guessing.
Frequently Asked Questions
How much traffic do I need to run an A/B test?
It depends on the baseline conversion rate and the size of the effect you’re trying to detect. As a rough rule, detecting a 10% relative improvement in a 2% conversion rate at 95% confidence requires somewhere around 30,000–40,000 visitors per variant. Smaller effects require much larger samples. Sample size calculators (free tools from Optimizely, VWO, Evan Miller’s site, and many others) give specific numbers for specific situations. Pages with under a few thousand monthly visitors generally can’t produce meaningful A/B test results in reasonable time.
Should I run A/B tests on emails?
Yes, if your list is large enough. Most email service providers (Mailchimp, Klaviyo, Constant Contact, ActiveCampaign) include built-in A/B testing for subject lines, send times, and content variants. Subject line testing is particularly impactful and works on relatively modest list sizes. The same sample-size discipline applies: small lists won’t produce statistically meaningful results.
How long should an A/B test run?
Long enough to collect the pre-calculated sample size, and long enough to capture variability in your audience’s behavior (typically at least one full business cycle, often a full week or two even if the sample size is hit earlier). Tests that end after a couple of days may not capture day-of-week or time-of-day patterns and can mislead.
What if my A/B test shows no significant difference?
This is a legitimate and common outcome. It typically means either the two versions are genuinely similar in effectiveness, or the test didn’t have enough power to detect a real difference. Either way, the answer is the same: don’t ship the variant on the assumption it’s better, and consider whether the next test should be a bigger change that has more potential to produce a detectable effect.
Is A/B testing only for digital marketing?
The methodology applies anywhere you can split an audience and measure outcomes. Direct mail campaigns have used split testing for decades. Retail stores have tested layouts and merchandising. The digital context just makes the mechanics much easier (instant assignment, automatic measurement) than offline alternatives. The underlying statistical discipline is the same regardless of the channel.







