Randomization units for reliable product experiments

The experimentation feature from LaunchDarkly enables teams to validate the impact of features on end-users, ideally using the results to create better applications. Many small factors of an experiment can have a significant effect on the validity of the results; randomization units are one such factor. They can determine how participants interact with an experiment and how reliably teams can measure the impact of the changes being implemented.

The choice of randomization unit in an experiment is closely tied to the set of metrics you want to measure. After you have set your randomization unit, it serves as the starting point for the universe of metrics you can use and vice versa. So it’s important to at least understand your randomization unit when setting up an experiment, and ideally to set it before considering metrics.

What are randomization units?

The randomization unit of an experiment is the level at which we assign experimental groups (i.e., variants)—for example, “distinct users” or “distinct sessions.” Often, we refer to the randomization unit by its unique identifier, such as user_id or session_id.

Randomization units, aside from affecting what experiences users in the experiments will see, have a profound impact on the metrics you can and should attach to an experiment:

For statistical validity out of the box, the randomization unit must match the metric analysis unit. Experiment metrics should be analyzed in the same unit as the randomization. If they aren’t, then the differences in granularity can cause statistical summaries to be invalid, at least without an additional adjustment.
The randomization unit determines the possible scope of measurement for your metrics. The randomization unit establishes the boundary within which consistency is guaranteed in an experiment, and therefore the window inside which we can attribute movements of our metric to differences in variations experienced by the unit.

Statistical validity and units

Statistical analyses generally assume that observations are independent and identically distributed (IID). Among other things, this assumption allows us to accurately quantify the amount of noise or variability in our data. Clustering (due to a mismatch between randomization and analysis units) can break this assumption, causing us to believe that the level of noise in our data is:

lower than it actually is (which can lead to more false positives)
higher than it actually is (which can lead to experiments that take too long or never achieve significance)

As an example, suppose you’re analyzing a conversion rate experiment randomized on user_id. The independence assumption means that knowing whether user_abc converted doesn’t convey any information about whether user_def converted.

Now, suppose you randomized your experiment by user_id, but you want to analyze a session-level metric, such as whether each user converted in each particular browser session. Since multiple sessions may belong to the same user, it’s plausible that groups of sessions belonging to the same user will have similar probabilities of conversion.

For a real-world example, imagine I’m running an experiment showing different versions of an ad promotion and randomizing it by user, but I want to track the conversion rate per session.

Suppose further that we know Jimmy never clicks on ad promotions, and Diane always does. Then, if I know that session_123 and session_456 both belong to Diane or both belong to Jimmy, knowing the outcome of session_123 (i.e. whether or not it converted) tells me something about the outcome of session_456— that is, the outcome of session_123 and the outcome of session_456 are correlated. Any analysis at this level on this user-randomized experiment will therefore result in statistical summaries that may be inaccurate (exactly how inaccurate depends on each scenario).

It’s important to note that it’s not impossible to analyze experiments this way. There are methods to correct for clustering when you know that the units will not match. Still, they complicate the analysis and generally shouldn’t be used if you can simply just match up your metrics and randomization unit.

Measurement scope

Randomization units also determine the potential scope of measurements for your metrics. Because the experience is held constant for a given randomization unit, anything that happens within that unit can be consistently ascribed to a specific experience. This principle is probably best explained through an example.

Free trial conversion: randomizing by session and by user

Suppose you’re testing a new trial page layout with a different design for the badge thingy that reminds you you’re in a trial, which, when clicked, funnels users into a flow where they can convert to a paid subscription. Consider two choices for how you might want to randomize this experiment: session and user.

Randomizing by session means that each time a new browser session is opened, a different variant of the trial page might appear for a single user. Randomizing by user means that this user will always see one particular variant, regardless of how many sessions they trigger during the experiment.

As explained previously, the analysis unit of an experiment should also match the randomization unit for the experiment. So in the session-randomized experiment, you could track metrics at the session level, such as:

(1a) Percent of sessions that resulted in a click on the banner (i.e., click-through rate, CTR)
(1b) Percent of sessions that resulted in a conversion to paid plan

In a user-randomized experiment, on the other hand, you could track metrics like:

(2a) Percent of users who clicked on the banner
(2b) Percent of users who ultimately converted to a paid plan

In both cases, you can construct metrics that track clicks and paid conversions. But the interpretation is different.

Click metrics

Look at the click metrics (1a) and (2a). In the session-denominated metric (1a), the scope is the experience of the user as they view the variations within that single browser session. The focus is on whether the visual content motivates them to click immediately. In the user-denominated metric (2a), the focus is instead on the potentially repeated exposures of a single user, allowing you to consider things like learning effects as the user witnesses the banner over two, three, or more visits to the page. Either one is valid depending on where the focus is.

Conversion metrics

A similar story applies with the paid conversion metrics (1b) and (2b), although in this case, there’s a clearer argument that one makes more sense. When thinking about paid conversions, we ultimately probably care about the number of users who convert, not whether a single session motivated them to do so or how many sessions it took them to convert, etc. So even though you could measure the lift in whether a variation causes individual sessions to immediately convert right then and there, it makes more sense to do this at the user level where the denomination of the metric leaves room for multiple sessions to “build up” on the user and eventually convince them to convert.

Think ahead for the best experiment impact

Randomization units are extremely important to consider carefully. It’s often not obvious how they can impact anything aside from what users can expect to see. But they should be one of the first things you design when setting up an experiment, because of their potential impact on what you can and should measure downstream.

Like what you read?

Get a demo

Jimmy Jin

Staff Data Scientist

Randomization units as the foundation of reliable product experiments

Sign up for our newsletter