reading results update ready for review#37790
Conversation
Preview links (active after the
|
|
|
||
| ### Confidence intervals | ||
|
|
||
| The confidence interval is a range of lift values that the observed data supports. The true lift could fall outside this range, but values inside the interval are more consistent with what the experiment measured. |
There was a problem hiding this comment.
I'd prefer to say "that are consistent with the observed data." The way frequentist hypothesis testing works is that it doesn't directly "support" specific values/null hypotheses, but rather it "rules out" hypotheses/values that (if true) would make the observed data very unlikely. I think it's technically more precise to say "not inconsistent with the observed data" but it reads like a double negative so I'm willing to compromise there haha.
|
|
||
| The confidence interval is a range of lift values that the observed data supports. The true lift could fall outside this range, but values inside the interval are more consistent with what the experiment measured. | ||
|
|
||
| - If the **entire interval is above zero**, the result is statistically significant in the positive direction. The improvement to the metric is unlikely to be attributable to random variation. |
There was a problem hiding this comment.
Stats snobs would push back on these because they sound like Bayesian interpretations (the wording sounds like it's referring to P(lift is real | data) rather than P(observed lift at least this large | true effect = 0). I'd say something like "An improvement at least this large is unlikely to occur if there is no true effect" or even just "The observed lift is inconsistent with a no true effect."
|
|
||
| - If the **entire interval is above zero**, the result is statistically significant in the positive direction. The improvement to the metric is unlikely to be attributable to random variation. | ||
| - If the **entire interval is below zero**, the result is statistically significant in the negative direction. The treatment likely reduced the metric. | ||
| - If the **interval crosses zero**, the result is not statistically significant. The observed lift may have occurred by chance. |
There was a problem hiding this comment.
I'd say "the result is consistent with a true effect of zero" instead of "The observed lift may have occurred by chance."
Lift may have occurrerd by chance sounds like it's making a statement about P(H0) not P(Data | H0)
| For each metric, the {{< ui >}}Global lift{{< /ui >}} tab displays: | ||
|
|
||
| - **Control and treatment values**: The average per-subject metric value in each variant—the same values shown on the main scorecard tab. | ||
| - **Coverage**: The estimated proportion of your global metric total that would come from the eligible population under a control-only rollout. |
There was a problem hiding this comment.
I think that the "control-only rollout" description sounds a little jargon-ey and obscures the nice intuition of coverage. Maybe something like:
"The estimated proportion of your global metric total associated with the experiment's eligible population (excluding the effect of the experiment)"
I really want people to grok coverage as e.g., the % of revenue exposed to the change being tested rather than some technical causal inference concept. I think the "control only" correction is an afterthought/implementation detail
brett0000FF
left a comment
There was a problem hiding this comment.
Thanks! The only blocking feedback from Docs is that we need to add back the deleted image.
There was a problem hiding this comment.
Can you please add this file back? We have a job that deletes outdated images. We have to leave this hear temporarily so that the image doesn't break on non-English pages. Thanks!
| <div class="alert alert-info"><strong>How metrics are calculated</strong><br><br> | ||
| Datadog analyzes experiments at the <strong>subject</strong> level—the unit you configured when you set up the experiment, typically a user. Datadog computes a metric value for each enrolled subject (for example, revenue per user or whether the user completed a signup). These per-subject values form a distribution for each variant. Datadog's statistical engine then compares these distributions between control and treatment.<br><br> | ||
| <strong>Relative lift</strong> measures how much the treatment shifted the average per-subject metric value compared to the control:<br><br> | ||
| <pre><code>Relative lift = (Treatment − Control) / Control</code></pre> | ||
| A relative lift of 10% means the treatment group's average per-subject value is 10% higher than the control group's average. Negative lift means the treatment performed worse on average. | ||
| </div> |
There was a problem hiding this comment.
This reads a bit too big to fit into a Note callout. I'd recommend pulling it back out into a section, or if you want to deemphasize it, you could put it inside a collapse-content shortcode.
AI assistance
Gave Claude a draft, edited output manually