Most testing programs do not fail because the tests are wrong. They fail because nobody writes down the decision in advance, the test gets stopped the day it looks good, and the "winner" quietly evaporates in the P&L a month later. A test that does not change a decision is a vanity metric with extra steps. This is the operating manual: what to test, how to size it, when to call it, and how to prove the win in revenue before you scale it.
Test the money path first, not the button color
Rank your roadmap by impact on contribution margin, not by how fast a change ships. A color swap on a low-traffic page cannot move a number you would report to a board. Concentrate where revenue concentrates: the highest-traffic landing pages, the checkout, the pricing page, the primary CTA, and the offer itself.
Score every idea with PIE or ICE so prioritization is not a popularity contest. Rate Impact, Confidence, and Ease from 1 to 10, average the three, and test top-down. Anchor each hypothesis in research, not opinion. If you have not done conversion research (heatmaps, session replays, surveys, funnel drop-off), you are not testing, you are guessing with a calculator.
- High leverage (start here): offer and pricing structure, hero value proposition, checkout steps, lead-form length, primary CTA.
- Medium leverage: page layout, social proof placement, trust signals, navigation.
- Low leverage (deprioritize): button colors, font tweaks, decorative imagery, anything on a page under ~1,000 visitors per month.
Write a hypothesis that forces a decision
A real hypothesis names the change, the predicted effect, and the metric you will judge it on. Use this template: "Because [research insight], we believe [change] will cause [effect] for [audience], measured by [primary metric]." Worked example: "Because session replays show 40% of users abandon at the shipping step, we believe moving shipping cost above the fold will lift completed checkouts for paid traffic, measured by revenue per visitor." If you cannot fill every bracket, you are not ready to build the variant.
Pick one primary metric per test, and tie it to money: revenue per visitor, qualified leads, or completed checkouts. Clicks and other micro-conversions are diagnostics, not verdicts. A button that earns more clicks but fewer purchases is a loss. Keep the change isolated enough that a win tells you why it won.
A test that does not change a decision is a vanity metric with extra steps.ADGY
Size the test before you launch it, every time
The single biggest cause of fake wins is stopping early. You stop the math from lying by fixing sample size and duration before launch, then not touching the test until you hit both. Run this sequence:
- Pull the current baseline conversion rate for the page (say, 3%).
- Set your minimum detectable effect (MDE): the smallest lift worth shipping. With no historical data, start with a 2 to 5% relative lift.
- Feed baseline, MDE, 95% confidence, and 80% power into a sample-size calculator (Evan Miller, AB Tasty, or your platform's built-in).
- Divide the required sample per variant by your daily traffic to get duration. Round up to whole weeks.
- Run full weeks only (minimum two) to absorb weekday, weekend, and payday cycles. Cap most tests at 6 to 8 weeks before you rethink the idea.
- Do not read significance daily, and do not stop the moment it crosses 95%. Call it only at the pre-set end date.
Volume reality check: reliable tests generally want on the order of tens of thousands of visitors and a few thousand conversions per variant. If your traffic cannot deliver that inside 8 weeks, the test is too small to trust. Bigger swings need less traffic, so on low-traffic sites test bold changes (a new offer, a new layout) rather than tiny ones.
Stop peeking: use sequential testing if you cannot wait
Checking a frequentist test repeatedly and stopping when it looks good inflates false positives fast: an honest 5% error rate can balloon past 25% when you peek after every batch of data. Early stopping biases Bayesian tests too, despite the myth that they are immune. The fix is simple. If you must monitor continuously, use a platform that runs sequential or always-valid inference, which corrects the math for repeated looks. If your tool is plain frequentist, pick an end date and keep your hands off the dashboard until then.
- Do: pre-register the metric, sample size, duration, and decision rule in writing before launch.
- Do: run an A/A test once a quarter to confirm your tooling reports no difference when there is none.
- Don't: stop a test the first day it hits 95%.
- Don't: add or remove variants mid-test, or change the traffic split once it is running.
- Don't: call a winner off a single good weekend.
Validate the win against the P&L before you scale it
Statistical significance is not the finish line. Profit is. A variant can lift conversion rate and still cost you if it pulls in lower-quality leads, attracts discount-driven buyers, or lifts first orders while tanking repeat rate. Before you roll a winner to 100%, run the downstream checks below:
- Revenue per visitor: did total revenue per session actually rise, or did you just shift volume to cheaper orders?
- Average order value: confirm the variant did not win by training buyers to buy less.
- Refund and return rate: a checkout that lifts orders but raises chargebacks is not a win.
- Retention cohort: where you can track it, check whether the cohort the variant produced repeats at the same rate or better.
Tie results back to unit economics and growth strategy so testing feeds the business, not just a dashboard. For the highest-leverage conversion moves, our landing page optimization and click-through rate guides go deeper.
Build a program, not a pile of one-off tests
One-off tests spike and fade. A system compounds. Teams that win run a continuous loop and keep an archive, so they stop re-running settled questions and start stacking learnings. Log every test: hypothesis, screenshots, sample size, result, and the decision you made. Run the loop in this order:
- Research: mine analytics, replays, and surveys for friction and drop-off.
- Hypothesize: write the bracketed hypothesis, pick one money metric.
- Prioritize: score with PIE or ICE, test top-down.
- Size and run: pre-register sample, duration, and decision rule.
- Decide: ship, kill, or iterate based on the pre-set rule.
- Validate: confirm the lift holds in revenue and retention.
- Document: log it so the next test starts smarter.
If you want this built and run on your numbers rather than guesses, that is our end-to-end offer, and for the roadmap and prioritization layer, our strategic advisory. When you are ready to put a profit-first testing system in place, talk to us.
Frequently asked questions
How long should I run an A/B test?
Run full weeks only, minimum two, so you cover weekday and weekend behavior and at least one payday cycle. Cap most tests at 6 to 8 weeks. The real number comes from a sample-size calculation done before launch: hit your pre-set sample and duration, then call it. Do not stop the day it looks significant.
What sample size do I actually need?
It depends on your baseline conversion rate and the smallest lift worth shipping (your MDE). Feed baseline, MDE, 95% confidence, and 80% power into a calculator. As a rule of thumb, reliable tests generally want on the order of tens of thousands of visitors and a few thousand conversions per variant. If your traffic cannot deliver that inside 8 weeks, test bolder changes that produce bigger, easier-to-detect swings.
Is Bayesian testing safe to peek at?
No. The common claim that Bayesian testing is immune to peeking is wrong: stopping early still biases the result. If you need to monitor continuously, use a platform with sequential or always-valid inference, which corrects the math for repeated looks. Otherwise, set an end date and keep your hands off the dashboard until you reach it.
My traffic is too low to test. What do I do?
Stop testing tiny changes. Low-traffic sites should test bold, high-leverage changes (a new offer, a restructured page, a shorter form), because larger effects need less traffic to detect. Combine related pages, extend duration within the 8-week cap, and lean harder on qualitative research to pick fewer, higher-confidence bets.
When can I call a winner and roll it out?
Only after two things are true: the test reached its pre-set sample size and end date with a significant result, and the lift survives a P&L check. Confirm revenue per visitor, average order value, refund and return rate, and retention all hold or improve. If conversion rose but revenue per visitor or repeat rate fell, it is not a winner.
