7 Things You (Probably) Don’t Know About A/B Testing
A or B? Which does the user prefer?
Two options, one winner, ship the better one.
That’s usually how A/B testing is explained and in theory, it checks out.
But once you’ve run a few real experiments, it becomes pretty clear that A/B testing isn’t just about picking a winner.
It involves going beyond the result to understand what influenced it, who it worked for and whether it actually matters beyond the test environment.
Context matters. Timing matters. Segments matter. Even how you measure “better” matters.
So in this edition of The Product Notebook, we’re breaking down 7 things you probably don’t know (or maybe haven’t fully considered) about A/B testing and how they affect the way you run and interpret your tests.
But first, a quick refresher on the basics
How A/B Testing works
Think of it like a controlled taste test. Two recipes, same kitchen, same ingredients; except one thing is different.
You don’t tell the tasters which is which. You just watch what they reach for.
At the end, you have a result that wasn’t (hopefully) influenced by preference, bias, or opinion. Just behaviour.
A/B testing follows the same idea; you isolate a single variable and expose it to two randomly assigned groups of users at the same time.
One group gets the existing experience. The other gets your proposed change.
You measure both against a pre-set metric and let user behaviour speak for itself. The test runs until you’ve collected enough data to trust the result.
Simple in concept. But in practice, the complexity isn’t in running the test.
It’s in interpreting what the result actually means and more importantly, what it doesn’t.
That’s exactly what these seven points will unpack.
1) Letting a test run for too long is a never a great idea.
Running a test without a set end doesn’t make the result better, it ends up making it less dependable and here’s why;
Your users don’t stay the same and surrounding variables change.
The people using your product in week one aren’t behaving the same way by week 10.
Maybe a marketing campaign goes live and brings in a different type of user.
Maybe it’s month-end and spending behavior changes.
Maybe a competitor launches a new feature that changes expectations.
For Example: if you’re testing a new checkout flow and a discount campaign starts halfway through, your conversion rate might go up but not because of your test.
Now your result is mixing two things: your experiment and everything happening around it.
Then there’s the novelty effect. When users first encounter something new, engagement tends to spike not because the variant is better, but because it’s different.
That lift shows up in your data and it’s easy to read as a win. But it isn’t durable. Over time curiosity fades, behaviour normalises, and the numbers drift back toward where they were.
If you happened to be watching the dashboard when the spike was at its peak, there’s a good chance you already made the call based on a temporary surge, not a consistent pattern.
So what’s the right window? Two weeks is what most teams default to and for a lot of products it’s a reasonable starting point. But this is not a one size fits all situation.
For Example: A ride-hailing app testing a change to the booking screen might gather enough data quickly since ride requests happen often.
But if they’re testing something like “driver retention over 30 days” or “repeat usage after first ride,” those outcomes take longer to materialise and a short test window won’t capture the full picture.
Every test needs a finish line set before the data starts coming in. The right question going in is “how long does it take for enough of the right event(s) to happen?” Answer that clearly and your test window will be far more defensible than whatever is considered “best practice”.
2. Your sample size way matters more than you think
“How many users need to see this?”
This is a key question to bring up before running your tests.
That number, your required sample size, is the foundation everything else sits on. Get it wrong and it doesn’t matter how well the test was designed, how clean the implementation was, or how long it ran.
Because without enough data, you’re not uncovering a real pattern. You’re just seeing random behaviour that looks structured enough to believe.
Think about it this way.
If you wanted to know which of two restaurant dishes people preferred but only five people had tried each one; waiting another week doesn’t make that feedback more trustworthy. What you need is more people at the table. Fifty. Maybe five hundred.
The same logic applies here.
Every A/B test needs a minimum amount of data before you can trust the result, that’s what statistical power represents.
It’s what helps you avoid false positives: results that look like wins but are actually just random variations.
Plus, that number isn’t arbitrary. It’s defined before the test starts, based on three things: the size of the impact you’re trying to detect, how much your metric naturally fluctuates, and how confident you want to be in the outcome
So before the next test launches, calculate the sample size you need.
And if your traffic means hitting that number would take months, that’s worth taking seriously, because it might mean A/B testing isn’t the right method for where you are right now.
Products with low traffic often get more useful insights from usability testing, user interviews, or session recordings than from an experiment that will never accumulate enough data to be conclusive.
3. A winning variant tells you what happened. It has no idea why.
Say you’re running an e-commerce store and you test two versions of your product page. Version A has an “Add to Cart” button while version B says “Get It Today.”
Version B wins and conversion rate goes up 18%. Good result. But what actually drove that?
Was it the urgency in the word “Get”? Did “Today” nudge users who were already on the fence? Did the new copy just feel more human than the clinical version it replaced?
Or was it something else entirely; the user segment that happened to be in the test, the time of month it ran, the fact that a competitor was out of stock that week?
The test cannot tell you. All it confirms is that Version B outperformed Version A under those specific conditions.
Everything else is you filling in the gap and the tricky part is that the gap always gets filled, usually with whatever explanation feels most logical in the room.
This is how teams end up trusting the wrong insight. They take a surface-level win, assume they know what drove it, and apply that logic everywhere.
Over time, the results will fall apart because the pattern they acted on wasn’t actually clear.
So, the result of a test is a starting point, not a conclusion. What turns it into something useful is the work you do around it such as talking to users who converted and users who didn’t, watching session recordings from the test period and running a follow-up experiment that isolates the variable you think was responsible.
Because if you don’t know what drove the result, you haven’t really learnt anything.
4. A test that finds nothing didn’t fail. It did exactly what it was supposed to.
You might think that if the test didn’t produce a winner, it was a waste of time.
It wasn’t.
Understand that A/B testing isn’t designed to confirm your ideas. It’s designed to check whether your ideas actually change anything. And sometimes, they don’t.
For Example: Let’s say you change the colour of a CTA button from blue to green. You run it, hit your numbers, and the result comes back flat. No meaningful difference either way.
You may be tempted to think nothing happened. But something did.
You just learned that your users aren’t making decisions based on button colour at least not in a way that registers. That’s not nothing. That’s an insight. It tells you the friction you’re trying to solve probably isn’t sitting where you thought it was, which means you can stop spending energy there and start looking looking somewhere that might actually move the needle.
This is what a healthy experimentation culture looks like from the inside.
Not a streak of wins on a dashboard but instead a steady and consistent process of eliminating what doesn’t matter so the team can focus on what does. Null results are part of that process.
Over time, those “nothing happened” results are what improve your focus, so when something does move, you understand why.
5. Running multiple experiments at the same time can distort your results.
Running more than one test at the same time, in the same product, is one of the easiest ways to skew your results.
This is because your users don’t experience them separately, they experience your product as one continuous thing.
And when two experiments are running concurrently on overlapping users, they stop being independent. They start affecting each other in ways that are almost impossible to untangle after the fact.
For Example: Let’s say one team is testing a new landing page headline to improve sign-ups. While another is testing a shorter onboarding flow to improve activation. Both tests are live. Both are pulling from the same user base.
Your sign-up numbers go up. Good news but whose win is it?
Was it the headline that drove more people in?
The smoother onboarding that made completing the flow feel easier?
Or did the two changes interact in a way that neither team accounted for, creating a lift that neither change would have produced on its own?
The test can’t tell you so it gets genuinely problematic.
One test might be amplifying the other’s impact, quietly cancelling it out, or behaving completely differently depending on which combination of variants a particular user happened to see. What may look like a clear result is really a tangle of variables that nobody separated before the data started coming in.
This is called an interaction effect, and it’s it fairly common during experimentation.
The discipline required here isn’t complicated but it does require coordination.
Know what tests are live before you launch a new one. Map the overlap. If two experiments are touching the same users at the same time, either stagger them or explicitly account for the interaction in how you read the results.
6. Segments Matter More Than Average
A/B test results almost always show up as a single number. A lift. A drop. A percentage change that the team rallies around in a review meeting.
It feels simple. Actionable. Unfortunately, it can also be misleading.
When you rely on averages, you trade nuance for simplicity and that trade-off isn’t always in your favour.
For example; You run a test on your onboarding flow and see a +4% increase in completion rate.
Looks like a win. But when you break it down:
New users improved by +12%
Returning users dropped by -6%
Now the story changes.
The “average” result suggests success but in reality, you’ve improved one group while hurting another.
And depending on your product, that trade-off might not be acceptable.
This is why segmentation is part of reading a result, not something you do afterwards if you have time.
New versus returning. Mobile versus desktop. High intent versus low. Users from different acquisition channels who often behave in completely different ways on the exact same surface.
Sometimes the real win is buried inside a segment the top line number never showed you and sometimes the win is masking a loss in a group you can’t afford to lose.
Good product decisions aren’t made on averages. They’re made on understanding which users are driving the change and what it means for the ones who aren’t.
7. A/B Testing Can’t Fix a Bad Hypothesis
A/B testing is only as good as the thinking that goes into it before the test starts.
The tool doesn’t validate ideas. It measures them. Which means if the idea going in is built on a shaky assumption, the test will dutifully measure that too and the result, whatever it is, won’t tell you what you actually need to know.
For Example: Let’s say you notice users are dropping off during sign-up and your immediate conclusion is that “The form is too long.”
So you test it. You remove a few fields. Shorten the flow. Make it feel lighter.
You run the A/B test and nothing changes.
At this point, it’s tempting to think; maybe the change wasn’t dramatic enough. Maybe a few more fields need to go. So another round gets planned, chasing the same hypothesis with a slightly different variation.
But its possible the form was never the problem.
When someone actually talks to the users who dropped off, the picture changes completely.
You then realise that they didn’t leave because the form felt long. They left because they weren’t convinced it was worth completing. Nobody had told them clearly enough what was waiting on the other side. The value proposition was doing too little work too late, and by the time users hit the form they were already halfway out.
It was not a form problem. It was a trust and clarity problem. And no amount of field removal was going to fix it.
This is what happens when the hypothesis is built on surface-level assumptions. The test focuses on what’s easy to tweak, not what’s actually influencing user behaviour.
Everything about the experiment can be done right; the setup, the execution, the analysis and still lead nowhere.
Hence, the quality of an A/B test is determined before it ever runs. What do you believe is driving the behaviour, and what evidence supports that? If it’s based on instinct alone, your hypothesis may need some refinement before it needs testing.
In theory, A/B testing is simple: pick the best version and ship it.
In practice, it’s a complex exercise in human behavior and environmental variables.
This is because what you see in a test is never just the change you made. It’s timing, context, user mix, and everything happening around that moment all interacting at once.
And that’s where most of the mistakes come from; not running the test, but interpreting it too quickly, too literally, or without enough context.
The teams that get real value from A/B testing aren’t the ones chasing wins.
They’re the ones trying to understand what’s actually driving the outcome.
Once you understand that, you’re not just shipping better variants but making better decisions.
Until Next Time













Always looking forward to new episodes.