Block 3: Statistical Analysis
Bootstrapping is a resampling-based statistical method introduced by Bradley Efron in 1979. The core idea is elegantly simple: when you cannot derive the sampling distribution of a statistic analytically, you can approximate it by repeatedly resampling from the data you already have.
The statistical principle behind bootstrapping rests on the plug-in principle: we treat the observed sample as if it were the population and draw repeated samples from it. If our original sample is reasonably representative of the population, the distribution of a statistic across many bootstrap samples will approximate the true sampling distribution of that statistic.
Bootstrapping does not create new data or new information. It leverages the existing sample to estimate the variability of a statistic (e.g., mean, median, regression coefficient) without relying on parametric assumptions about the underlying population distribution.
The defining characteristic of bootstrapping is sampling with replacement. Given an original sample of size n, each bootstrap sample is drawn by randomly selecting n observations from the original sample, where each observation can be selected more than once (or not at all).
| Concept | Description |
|---|---|
| Original Sample | The dataset you collected (size n) |
| Bootstrap Sample | A sample of size n drawn with replacement from the original |
| Bootstrap Replicate | The statistic computed on one bootstrap sample |
| Bootstrap Distribution | The collection of all bootstrap replicates (typically B = 1,000 to 10,000) |
Because we sample with replacement, any individual bootstrap sample will typically contain some duplicated observations and omit some original observations. On average, about 63.2% of the original observations appear in any given bootstrap sample (the rest are left out, forming the "out-of-bag" sample).
Without replacement, every sample of size n from a dataset of size n would be identical to the original — there would be no variability. Replacement is what introduces the randomness needed to estimate sampling variability.
Bootstrapping applies to both discrete and continuous data, but there are important differences:
| Aspect | Discrete Data | Continuous Data |
|---|---|---|
| Example | Count of defective items, survey responses (1-5) | Heights, temperatures, stock prices |
| Bootstrap behavior | Bootstrap samples have many tied values; the bootstrap distribution of the statistic tends to be "lumpy" | More variety in bootstrap samples; bootstrap distribution is smoother |
| Effective? | Yes, but may need more replicates for stable estimates | Typically works well with standard B values |
| Caution | Small discrete samples can produce unreliable bootstrap distributions | Extreme outliers can disproportionately affect results |
Bootstrapping is not always the best choice. Understanding when to use it versus parametric methods is critical:
Use bootstrapping when:
Prefer theoretical distributions when:
Bootstrapping does not fix a biased or unrepresentative sample. If your original sample poorly represents the population, bootstrap estimates will also be poor. The quality of bootstrap inference depends entirely on the quality of the original sample.
Here is how to generate bootstrap samples and analyze their distribution using NumPy:
You can also bootstrap other statistics, such as the median:
To assess the reliability and validity of bootstrap results, you should examine:
There are two common methods for constructing bootstrap confidence intervals:
Simply take the desired percentiles of the bootstrap distribution:
Uses the bootstrap standard error with a normal approximation:
For the exam, focus on the percentile method. Remember: a 95% CI uses the 2.5th and 97.5th percentiles, and a 90% CI uses the 5th and 95th percentiles of the bootstrap distribution.
Linear regression models the relationship between a continuous dependent variable (y) and one or more independent variables (X). It is the foundation of predictive modeling and one of the most important topics for the exam.
| Assumption | Description | What Happens if Violated |
|---|---|---|
| Linearity | The relationship between X and y is linear | Model will systematically mis-predict; residuals show a pattern |
| Independence | Observations are independent of each other | Standard errors are incorrect; confidence intervals are unreliable |
| Homoscedasticity | The variance of residuals is constant across all levels of X | Coefficient estimates are still unbiased, but standard errors and p-values are wrong |
| Normality of Residuals | The residuals (errors) follow a normal distribution | Affects inference (p-values, CIs) but coefficients remain unbiased |
The four assumptions: Linearity, Independence, Homoscedasticity, Normality of residuals. Note that normality applies to the residuals, not the variables themselves.
The simple linear regression equation is:
| Symbol | Meaning |
|---|---|
| y | Dependent (response) variable — what we are predicting |
| x | Independent (predictor) variable — input feature |
| β0 (b) | Intercept — the predicted value of y when x = 0 |
| β1 (m) | Slope — the change in y for a one-unit increase in x |
| ε | Error term — captures variation not explained by the model |
The Ordinary Least Squares (OLS) method finds the values of β0 and β1 that minimize the sum of squared residuals (SSR):
Where ŷi is the predicted value for observation i. The OLS solution gives us the "best fit" line that passes as close as possible to all data points.
R-squared measures the proportion of variance in y explained by the model:
While linear regression predicts a continuous outcome, logistic regression is used when the dependent variable is binary (0 or 1, Yes/No, True/False). It models the probability that an observation belongs to the positive class.
Common use cases:
Logistic regression uses the sigmoid (logistic) function to map any real-valued input to a probability between 0 and 1:
The sigmoid curve has an S-shape: for very negative inputs, the probability approaches 0; for very positive inputs, it approaches 1; and at the midpoint (β0 + β1x = 0), the probability is exactly 0.5.
Logistic regression is a linear model in log-odds space. The log-odds (logit) is:
This transformation converts a probability (bounded 0-1) into a value that can range from -∞ to +∞, allowing a linear model to work.
In logistic regression, coefficients are interpreted as changes in log-odds:
| Coefficient Value | Effect on Odds | Effect on Probability |
|---|---|---|
| β > 0 | Odds increase (OR > 1) | Probability increases |
| β = 0 | No effect (OR = 1) | No change |
| β < 0 | Odds decrease (OR < 1) | Probability decreases |
Remember that logistic regression coefficients are NOT directly interpretable as changes in probability. They represent changes in log-odds. To get odds ratios, exponentiate the coefficient: OR = eβ.
| Criterion | Linear Regression | Logistic Regression |
|---|---|---|
| Dependent Variable | Continuous (e.g., price, temperature, score) | Binary / Categorical (e.g., yes/no, pass/fail) |
| Output | A predicted value (any real number) | A probability between 0 and 1 |
| Function | y = β0 + β1x | Sigmoid / logistic function |
| Loss Function | Sum of Squared Errors (SSE) | Log-Loss (Cross-Entropy) |
| Key Metric | R-squared, RMSE, MAE | Accuracy, Precision, Recall, AUC-ROC |
| Coefficient Meaning | Change in y per unit change in x | Change in log-odds per unit change in x |
| Predicted Negative | Predicted Positive | |
|---|---|---|
| Actual Negative | True Negative (TN) | False Positive (FP) |
| Actual Positive | False Negative (FN) | True Positive (TP) |
Accuracy = (TP + TN) / (TP + TN + FP + FN)
While scikit-learn is great for prediction, statsmodels provides detailed statistical summaries including p-values, confidence intervals, and diagnostic statistics:
The ols_model.summary() output includes:
Unlike scikit-learn (which automatically includes an intercept), statsmodels requires you to explicitly add a constant column using sm.add_constant(X). Forgetting this will fit a model with no intercept, forcing the line through the origin.
| Output Field | What It Tells You |
|---|---|
coef | The estimated coefficient value (β) |
std err | Standard error of the coefficient — smaller is more precise |
t | t-statistic = coef / std err — tests if coef is significantly different from 0 |
P>|t| | p-value — if < 0.05, the coefficient is statistically significant |
[0.025, 0.975] | 95% confidence interval for the coefficient |
R-squared | Proportion of variance explained (0 to 1) |
Adj. R-squared | R-squared adjusted for number of predictors |
F-statistic | Tests overall significance of the model |
np.percentile(bootstrap_dist, 2.5) and np.percentile(bootstrap_dist, 97.5). A 90% CI would use the 5th and 95th percentiles.sm.add_constant(X) before fitting?statsmodels requires you to explicitly add a column of ones to the predictor matrix using sm.add_constant(X). Without this, the regression line is forced through the origin (intercept = 0).