P-values & Significance
When you fit a linear regression with one feature and find that its p-value is not significant, it means that there isn't enough statistical evidence to reject the null hypothesis that the feature's coefficient is zero. In other words, based on your data, you cannot confidently say that changes in this feature are associated with changes in the response variable.
Lack of Evidence for an Effect: The non-significant p-value suggests that the feature does not provide a reliable signal for predicting the outcome. The variation explained by this feature might just be due to random chance rather than a true underlying relationship.
Remember that a non-significant p-value doesn't prove there is no effect—it only suggests that you don't have sufficient evidence to claim an effect exists.
A small sample size can significantly impact your ability to detect an effect, even if one truly exists. When your sample is small:
As sample size increases, statistical power increases, improving your ability to detect true effects:
Note: Statistical power of 0.8 (80%) is typically considered adequate.
Small samples have reduced ability to detect true effects, leading to higher chance of Type II errors (false negatives).
With fewer observations, coefficient estimates have larger standard errors and are less precise. When data are scarce, the estimates of your coefficients can swing more widely from sample to sample. This is because each data point carries more weight.
Small samples may not capture the full variability of the population, leading to biased results.
The standard error of regression coefficients decreases as sample size increases:
For a regression coefficient, the standard error is calculated as:
SE(β̂) = √(σ²/Sₓₓ)
Where:
With smaller samples, Sₓₓ (the sum of squared deviations of x) is smaller, resulting in larger SE values.
As sample size increases, confidence intervals narrow, providing more precise estimates:
When possible, increase sample size to ensure better representation of the population and increase statistical power. This will reduce standard errors and provide more reliable coefficient estimates.
Noise in data refers to random variability or measurement errors that can obscure the true relationship between variables. Even with a large sample size, high noise levels can make it difficult to detect significant effects.
Inconsistencies in data collection, measurement precision, and random fluctuations introduce variability.
High noise inflates the residual variance (σ²), which directly increases the standard errors of coefficients.
When noise levels are high relative to the signal (true effect), the relationship becomes harder to detect.
Noise can be introduced through measurement errors, data collection inconsistencies, or random fluctuations. This noise adds extra variability to the response variable that is not explained by the feature.
In regression, the residual variance (or error term) captures the variability in the response that is not explained by the predictors. High noise levels inflate this residual variance.
The effectiveness of any predictor depends on the signal-to-noise ratio. When the noise level is high, the signal (i.e., the true impact of the feature) becomes harder to detect.
A non-significant result with a low signal-to-noise ratio doesn't necessarily mean there's no effect — it might simply indicate that the current data and methods aren't sufficient to reliably detect it.
In the context of linear regression, the error term ε represents the difference between the observed value yi and the predicted value ŷi (i.e., εi = yi − ŷi). The variance of the error term, usually denoted as σ², measures how much these errors (or residuals) vary around their mean (which is assumed to be zero).
Definition: The variance of the error term is defined as:
σ² = Var(ε) = E[ε²]
Since we assume that E[ε] = 0, the variance is simply the expected value of the squared deviations of ε from zero.
The error terms are assumed to have constant variance σ² across all levels of the independent variable(s). This means that the spread of the residuals does not change with different values of x.
For many inferential statistics, it's assumed that ε ~ N(0, σ²), which means the errors are normally distributed with mean zero and variance σ².
When the variance of the error term (σ²) is high relative to the effect size of your predictor, it becomes more difficult to detect a significant relationship. This high variance increases the standard errors of your coefficient estimates, leading to larger p-values and potentially non-significant results.
Though related, standard deviation and standard error measure different aspects of variability in your data and serve different purposes in statistical analysis.
What It Measures: The standard deviation quantifies the amount of variation or dispersion in a set of individual data points. It tells you, on average, how far each data point is from the mean of the data.
Calculation (for a sample):
s = √((1/(n − 1)) Σ (xi − x̄)²)
where x̄ is the sample mean and n is the number of observations.
Usage:
What It Measures: The standard error estimates the variability of a sample statistic (like the mean or a regression coefficient) from sample to sample. It reflects how much the estimate is expected to vary if you repeated the study multiple times.
Calculation (for the mean):
SE(x̄) = s/√n
where s is the sample standard deviation and n is the sample size.
Usage:
Aspect | Standard Deviation | Standard Error |
---|---|---|
What It Describes | Spread of individual data points | Precision of an estimated statistic |
Sample Size Dependence | Generally does not depend on sample size | Directly depends on sample size (SE ∝ 1/√n) |
Primary Use | Descriptive statistics | Inferential statistics |
Purpose | Describes variability in the dataset | Quantifies uncertainty in sample statistics |
In regression analysis, we are primarily concerned with the standard errors of our coefficient estimates, not the standard deviation of the data itself. The standard error tells us how precise our estimates are and directly influences:
When your sample size is small, the standard error of your regression coefficient will be larger. This larger standard error leads to a smaller t-statistic (t = β̂ / SE(β̂)), which can result in a non-significant p-value even when there is a true relationship between your predictor and the outcome.
A/B testing is a common application of statistical inference to compare two variants (e.g., a control group and a treatment group) to determine if a change has a statistically significant effect. The standard error plays a crucial role in these tests, especially when measuring metrics like conversion rates.
p = Conversions / Total Visitors
Example: 100/1000 = 0.10 (10%)
This step establishes your baseline metrics. Accurate conversion rate estimation is critical as it forms the foundation for all subsequent statistical calculations and determines what changes are worth implementing.
SE(p) = √(p(1−p)/n)
Example: √(0.10×0.90/1000) ≈ 0.0095
This quantifies the precision of your conversion rate estimate. The standard error represents how much your estimate would vary if you repeated the experiment multiple times. Smaller standard errors indicate more reliable estimates, which is essential for making confident decisions.
CI = p ± 1.96 × SE(p)
Example: 0.10 ± 1.96×0.0095 ≈ (0.081, 0.119)
Confidence intervals provide a range where the true conversion rate likely falls. A 95% CI means we can be 95% confident that the actual conversion rate is within this range. This helps assess uncertainty in your estimates before making comparisons between variants.
SE(Δp) = √(SE(pA)² + SE(pB)²)
Example: √(0.0095² + 0.0103²) ≈ 0.014
This step calculates the standard error of the difference between conversion rates. Since we're comparing two independent groups, we need to account for the combined uncertainty from both. This is crucial because it tells us how precise our estimate of the difference is.
z = (pB − pA) / SE(Δp)
Example: (0.12 - 0.10) / 0.014 ≈ 1.43
The z-statistic measures how many standard errors the observed difference is from zero (our null hypothesis). This standardizes the difference in a way that accounts for the inherent variability, allowing us to determine if the observed difference is statistically meaningful or likely due to chance.
Compare z to critical value (e.g., 1.96)
If |z| > 1.96, result is significant at 5% level
This final step determines whether your observed difference is statistically significant. The critical value of 1.96 corresponds to a 5% significance level, meaning there's only a 5% chance of seeing a difference this large or larger if there was actually no true effect. This helps prevent making changes based on random variation.
If the confidence intervals for the two groups overlap substantially, or if the CI for the difference includes 0, it indicates uncertainty about whether the change truly affects conversion rates.
A non-significant result (like a z-value lower than the critical threshold) means that based on your current data, you cannot confidently claim that the new feature has an impact.
Notice that the standard error is inversely related to the square root of the sample size. With a larger sample, the standard error would be smaller, potentially leading to a more precise estimate and a more sensitive test for detecting differences.
When running an A/B test, if you get a non-significant result, it could be because:
By carefully calculating and interpreting the standard error, you can make more informed decisions about whether launching a new feature will significantly improve conversion rates.
Determining the required sample size in advance is crucial for designing studies that have adequate power to detect effects of interest. This is particularly important in A/B testing and regression analysis.
The probability of a Type I error (false positive), often set at 0.05.
Critical z-value: For α = 0.05, zα/2 ≈ 1.96
The probability of correctly detecting a true effect, commonly set at 0.8 (80%).
Critical z-value: For power = 0.8, zβ ≈ 0.84
For A/B tests, this is the conversion rate in your control group. For regression, it relates to the expected variance.
Example: Historical conversion rate of 10%
The smallest improvement that you consider meaningful and wish to detect.
Example: 2 percentage point increase (10% → 12%)
n = ((zα/2 √(2p̄(1−p̄)) + zβ √(pA(1−pA) + pB(1−pB)))²) / (pB − pA)²
where:
Derived from the normal distribution, they represent how many standard deviations you need to capture the desired probability. They help set the threshold for significance and power.
The terms under the square roots capture the variance of the proportions. Larger variance requires larger sample sizes to detect effects reliably.
The denominator (pB − pA)² shows that smaller effects require a larger sample size to detect, since the "signal" is smaller compared to the inherent variability.
Note: Exact numbers depend on desired power, significance level, and baseline rates.
If your study was not properly powered (i.e., sample size was too small for the effect size), you might get non-significant results even when there is a true effect. This is why determining the appropriate sample size before conducting a study is crucial.
When you encounter a non-significant p-value in your regression analysis, it's important to diagnose whether this is due to insufficient sample size or high noise in the data. Here are step-by-step approaches for investigating each possibility.