Home

Topic 3.2: Inferential Statistics

Block 3: Statistical Analysis

3.2.1 Bootstrapping for Sampling Distributions

Theoretical Basis and Statistical Principles

Bootstrapping is a resampling-based statistical method introduced by Bradley Efron in 1979. The core idea is elegantly simple: when you cannot derive the sampling distribution of a statistic analytically, you can approximate it by repeatedly resampling from the data you already have.

The statistical principle behind bootstrapping rests on the plug-in principle: we treat the observed sample as if it were the population and draw repeated samples from it. If our original sample is reasonably representative of the population, the distribution of a statistic across many bootstrap samples will approximate the true sampling distribution of that statistic.

Key Insight for the Exam

Bootstrapping does not create new data or new information. It leverages the existing sample to estimate the variability of a statistic (e.g., mean, median, regression coefficient) without relying on parametric assumptions about the underlying population distribution.

Key Principles

  • Empirical Distribution Function (EDF): The bootstrap treats the sample as a stand-in for the population. Each observed value has an equal probability of being selected: 1/n.
  • Law of Large Numbers: As the number of bootstrap replicates (B) increases, the bootstrap distribution converges to the true sampling distribution of the statistic.
  • Non-parametric inference: Bootstrapping makes no assumptions about the shape of the underlying distribution (normal, uniform, etc.).

Resampling with Replacement

The defining characteristic of bootstrapping is sampling with replacement. Given an original sample of size n, each bootstrap sample is drawn by randomly selecting n observations from the original sample, where each observation can be selected more than once (or not at all).

ConceptDescription
Original SampleThe dataset you collected (size n)
Bootstrap SampleA sample of size n drawn with replacement from the original
Bootstrap ReplicateThe statistic computed on one bootstrap sample
Bootstrap DistributionThe collection of all bootstrap replicates (typically B = 1,000 to 10,000)

Because we sample with replacement, any individual bootstrap sample will typically contain some duplicated observations and omit some original observations. On average, about 63.2% of the original observations appear in any given bootstrap sample (the rest are left out, forming the "out-of-bag" sample).

Why With Replacement?

Without replacement, every sample of size n from a dataset of size n would be identical to the original — there would be no variability. Replacement is what introduces the randomness needed to estimate sampling variability.

Discrete vs Continuous Data in Bootstrapping

Bootstrapping applies to both discrete and continuous data, but there are important differences:

AspectDiscrete DataContinuous Data
ExampleCount of defective items, survey responses (1-5)Heights, temperatures, stock prices
Bootstrap behaviorBootstrap samples have many tied values; the bootstrap distribution of the statistic tends to be "lumpy"More variety in bootstrap samples; bootstrap distribution is smoother
Effective?Yes, but may need more replicates for stable estimatesTypically works well with standard B values
CautionSmall discrete samples can produce unreliable bootstrap distributionsExtreme outliers can disproportionately affect results

When Bootstrapping Is Effective vs Theoretical Distributions

Bootstrapping is not always the best choice. Understanding when to use it versus parametric methods is critical:

Use bootstrapping when:

  • The theoretical sampling distribution is unknown or difficult to derive (e.g., median, trimmed mean, ratio statistics)
  • The data is not normally distributed and sample sizes are small
  • You want to estimate confidence intervals for complex statistics
  • You are working with non-standard estimators (e.g., a custom metric)

Prefer theoretical distributions when:

  • Assumptions are met (e.g., large sample, normality) — the Central Limit Theorem provides exact results
  • Computational cost matters and parametric formulas are available
  • The sample size is very small (n < 10) — bootstrap can be unreliable
Exam Warning

Bootstrapping does not fix a biased or unrepresentative sample. If your original sample poorly represents the population, bootstrap estimates will also be poor. The quality of bootstrap inference depends entirely on the quality of the original sample.

Generating Bootstrap Samples in Python

Here is how to generate bootstrap samples and analyze their distribution using NumPy:

import numpy as np # Original sample data data = np.array([12, 15, 14, 10, 18, 22, 19, 16, 13, 17]) # Set random seed for reproducibility np.random.seed(42) # Number of bootstrap replicates n_bootstrap = 10000 # Generate bootstrap replicates of the mean bootstrap_means = np.array([ np.mean(np.random.choice(data, size=len(data), replace=True)) for _ in range(n_bootstrap) ]) # Analyze the bootstrap distribution print(f"Original sample mean: {np.mean(data):.2f}") print(f"Bootstrap mean of means: {np.mean(bootstrap_means):.2f}") print(f"Bootstrap std error: {np.std(bootstrap_means, ddof=1):.2f}")

You can also bootstrap other statistics, such as the median:

# Bootstrap replicates of the median bootstrap_medians = np.array([ np.median(np.random.choice(data, size=len(data), replace=True)) for _ in range(n_bootstrap) ]) print(f"Original sample median: {np.median(data):.2f}") print(f"Bootstrap mean of medians: {np.mean(bootstrap_medians):.2f}") print(f"Bootstrap std error of median: {np.std(bootstrap_medians, ddof=1):.2f}")

Analyzing Bootstrap Distributions

To assess the reliability and validity of bootstrap results, you should examine:

  • Shape: A bell-shaped bootstrap distribution indicates stable estimates
  • Center: The mean of the bootstrap distribution should be close to the original statistic
  • Spread: The standard deviation of bootstrap replicates is the bootstrap standard error
  • Bias: The difference between the bootstrap mean and the original statistic

Confidence Intervals from Bootstrapping

There are two common methods for constructing bootstrap confidence intervals:

1. Percentile Method (Most Common on Exams)

Simply take the desired percentiles of the bootstrap distribution:

# 95% confidence interval using the percentile method ci_lower = np.percentile(bootstrap_means, 2.5) ci_upper = np.percentile(bootstrap_means, 97.5) print(f"95% Bootstrap CI (percentile): [{ci_lower:.2f}, {ci_upper:.2f}]")

2. Standard Error Method

Uses the bootstrap standard error with a normal approximation:

# 95% CI using the standard error method sample_mean = np.mean(data) boot_se = np.std(bootstrap_means, ddof=1) ci_lower_se = sample_mean - 1.96 * boot_se ci_upper_se = sample_mean + 1.96 * boot_se print(f"95% Bootstrap CI (SE method): [{ci_lower_se:.2f}, {ci_upper_se:.2f}]")
Exam Tip

For the exam, focus on the percentile method. Remember: a 95% CI uses the 2.5th and 97.5th percentiles, and a 90% CI uses the 5th and 95th percentiles of the bootstrap distribution.

3.2.2 Linear and Logistic Regression

Linear Regression: Theory and Assumptions

Linear regression models the relationship between a continuous dependent variable (y) and one or more independent variables (X). It is the foundation of predictive modeling and one of the most important topics for the exam.

The Four Key Assumptions of Linear Regression

AssumptionDescriptionWhat Happens if Violated
Linearity The relationship between X and y is linear Model will systematically mis-predict; residuals show a pattern
Independence Observations are independent of each other Standard errors are incorrect; confidence intervals are unreliable
Homoscedasticity The variance of residuals is constant across all levels of X Coefficient estimates are still unbiased, but standard errors and p-values are wrong
Normality of Residuals The residuals (errors) follow a normal distribution Affects inference (p-values, CIs) but coefficients remain unbiased
Remember (L-I-H-N)

The four assumptions: Linearity, Independence, Homoscedasticity, Normality of residuals. Note that normality applies to the residuals, not the variables themselves.

Mathematical Foundation

The simple linear regression equation is:

y = mx + b (or equivalently, y = β0 + β1x + ε)
SymbolMeaning
yDependent (response) variable — what we are predicting
xIndependent (predictor) variable — input feature
β0 (b)Intercept — the predicted value of y when x = 0
β1 (m)Slope — the change in y for a one-unit increase in x
εError term — captures variation not explained by the model

Least Squares Estimation

The Ordinary Least Squares (OLS) method finds the values of β0 and β1 that minimize the sum of squared residuals (SSR):

SSR = Σ(yi - ŷi)² → minimize

Where ŷi is the predicted value for observation i. The OLS solution gives us the "best fit" line that passes as close as possible to all data points.

Interpreting Coefficients and R-squared

Interpreting Coefficients

  • Intercept (β0): The expected value of y when all predictors are 0. Sometimes this has a meaningful interpretation; other times it is just a mathematical anchor.
  • Slope (β1): For every one-unit increase in x, y is expected to change by β1 units, holding all other variables constant.

R-squared (Coefficient of Determination)

R-squared measures the proportion of variance in y explained by the model:

R² = 1 - (SSres / SStot)
  • R² = 0: The model explains none of the variance (no better than predicting the mean)
  • R² = 1: The model explains all of the variance (perfect fit)
  • R² = 0.75: The model explains 75% of the variance in y
Limitations and Potential Biases
  • R² always increases when you add more predictors, even irrelevant ones. Use Adjusted R² for multiple regression.
  • Correlation does not imply causation — a strong R² does not mean x causes y.
  • Linear regression is sensitive to outliers, which can dramatically shift the regression line.
  • Multicollinearity (correlated predictors) inflates standard errors and makes coefficients unstable.
  • Extrapolation beyond the range of observed data is dangerous and unreliable.

Logistic Regression: Concepts and Use Cases

While linear regression predicts a continuous outcome, logistic regression is used when the dependent variable is binary (0 or 1, Yes/No, True/False). It models the probability that an observation belongs to the positive class.

Common use cases:

  • Will a customer churn? (Yes/No)
  • Is an email spam? (Spam/Not Spam)
  • Will a patient be readmitted? (Readmitted/Not Readmitted)
  • Will a loan default? (Default/No Default)

The Sigmoid Function

Logistic regression uses the sigmoid (logistic) function to map any real-valued input to a probability between 0 and 1:

p(y=1|x) = 1 / (1 + e-(β0 + β1x))

The sigmoid curve has an S-shape: for very negative inputs, the probability approaches 0; for very positive inputs, it approaches 1; and at the midpoint (β0 + β1x = 0), the probability is exactly 0.5.

Log-Odds (Logit)

Logistic regression is a linear model in log-odds space. The log-odds (logit) is:

log(p / (1 - p)) = β0 + β1x

This transformation converts a probability (bounded 0-1) into a value that can range from -∞ to +∞, allowing a linear model to work.

Interpreting Logistic Regression Coefficients

In logistic regression, coefficients are interpreted as changes in log-odds:

  • β1 = 0.5: A one-unit increase in x increases the log-odds of the positive class by 0.5
  • To convert to an odds ratio: OR = eβ1 = e0.5 ≈ 1.65
  • Interpretation: A one-unit increase in x multiplies the odds of the positive outcome by 1.65 (a 65% increase in odds)
Coefficient ValueEffect on OddsEffect on Probability
β > 0Odds increase (OR > 1)Probability increases
β = 0No effect (OR = 1)No change
β < 0Odds decrease (OR < 1)Probability decreases
Exam Tip

Remember that logistic regression coefficients are NOT directly interpretable as changes in probability. They represent changes in log-odds. To get odds ratios, exponentiate the coefficient: OR = eβ.

Choosing Between Linear and Logistic Regression

CriterionLinear RegressionLogistic Regression
Dependent Variable Continuous (e.g., price, temperature, score) Binary / Categorical (e.g., yes/no, pass/fail)
Output A predicted value (any real number) A probability between 0 and 1
Function y = β0 + β1x Sigmoid / logistic function
Loss Function Sum of Squared Errors (SSE) Log-Loss (Cross-Entropy)
Key Metric R-squared, RMSE, MAE Accuracy, Precision, Recall, AUC-ROC
Coefficient Meaning Change in y per unit change in x Change in log-odds per unit change in x

Discrete vs Continuous Data Considerations

  • Continuous predictors work naturally in both linear and logistic regression
  • Discrete/categorical predictors must be encoded (e.g., one-hot encoding, label encoding) before use in regression models
  • Continuous target → use linear regression
  • Binary/discrete target → use logistic regression
  • Applying linear regression to a binary target produces probabilities outside [0, 1], which is why logistic regression exists

Python Implementation with scikit-learn

Linear Regression with sklearn

import numpy as np import matplotlib.pyplot as plt from sklearn.linear_model import LinearRegression from sklearn.model_selection import train_test_split from sklearn.metrics import r2_score, mean_squared_error # Sample data: hours studied vs exam score hours = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]).reshape(-1, 1) scores = np.array([45, 50, 55, 60, 62, 70, 72, 78, 82, 90]) # Split data into train and test sets X_train, X_test, y_train, y_test = train_test_split( hours, scores, test_size=0.2, random_state=42 ) # Create and fit the model model = LinearRegression() model.fit(X_train, y_train) # Get coefficients print(f"Intercept (beta_0): {model.intercept_:.2f}") print(f"Slope (beta_1): {model.coef_[0]:.2f}") # Predictions and evaluation y_pred = model.predict(X_test) print(f"R-squared: {r2_score(y_test, y_pred):.4f}") print(f"RMSE: {np.sqrt(mean_squared_error(y_test, y_pred)):.2f}")

Plotting the Regression Line

# Plot data points and regression line plt.figure(figsize=(8, 5)) plt.scatter(hours, scores, color='steelblue', s=80, edgecolors='navy', label='Data Points') # Generate predictions for the regression line X_line = np.linspace(0, 11, 100).reshape(-1, 1) y_line = model.predict(X_line) plt.plot(X_line, y_line, color='red', linewidth=2, label=f'y = {model.coef_[0]:.2f}x + {model.intercept_:.2f}') plt.xlabel('Hours Studied') plt.ylabel('Exam Score') plt.title('Linear Regression: Hours vs Exam Score') plt.legend() plt.grid(True, alpha=0.3) plt.tight_layout() plt.show()

Logistic Regression with sklearn

import numpy as np from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score, confusion_matrix # Sample data: hours studied vs pass/fail (1 = pass, 0 = fail) hours = np.array([1, 2, 2, 3, 3, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10, 11, 12]).reshape(-1, 1) passed = np.array([0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]) # Split and train X_train, X_test, y_train, y_test = train_test_split( hours, passed, test_size=0.3, random_state=42 ) log_model = LogisticRegression() log_model.fit(X_train, y_train) # Coefficients print(f"Intercept: {log_model.intercept_[0]:.4f}") print(f"Coefficient: {log_model.coef_[0][0]:.4f}") print(f"Odds Ratio: {np.exp(log_model.coef_[0][0]):.4f}") # Predictions y_pred = log_model.predict(X_test) y_prob = log_model.predict_proba(X_test)[:, 1] # Evaluation print(f"\nAccuracy: {accuracy_score(y_test, y_pred):.2f}") print(f"Confusion Matrix:\n{confusion_matrix(y_test, y_pred)}")

Model Evaluation Metrics

from sklearn.metrics import ( r2_score, mean_squared_error, mean_absolute_error, # for linear accuracy_score, confusion_matrix, classification_report # for logistic ) # --- Linear Regression Metrics --- # Assume y_test and y_pred_linear exist from a linear regression r2 = r2_score(y_test_lin, y_pred_linear) rmse = np.sqrt(mean_squared_error(y_test_lin, y_pred_linear)) mae = mean_absolute_error(y_test_lin, y_pred_linear) print("Linear Regression Metrics:") print(f" R-squared: {r2:.4f}") print(f" RMSE: {rmse:.4f}") print(f" MAE: {mae:.4f}") # --- Logistic Regression Metrics --- # Assume y_test and y_pred_logistic exist from a logistic regression acc = accuracy_score(y_test_log, y_pred_logistic) cm = confusion_matrix(y_test_log, y_pred_logistic) print("\nLogistic Regression Metrics:") print(f" Accuracy: {acc:.4f}") print(f" Confusion Matrix:") print(cm) print(f"\n Classification Report:") print(classification_report(y_test_log, y_pred_logistic))
Understanding the Confusion Matrix
Predicted NegativePredicted Positive
Actual NegativeTrue Negative (TN)False Positive (FP)
Actual PositiveFalse Negative (FN)True Positive (TP)

Accuracy = (TP + TN) / (TP + TN + FP + FN)

statsmodels OLS for Detailed Regression Summary

While scikit-learn is great for prediction, statsmodels provides detailed statistical summaries including p-values, confidence intervals, and diagnostic statistics:

import statsmodels.api as sm import numpy as np # Sample data X = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]) y = np.array([45, 50, 55, 60, 62, 70, 72, 78, 82, 90]) # IMPORTANT: add a constant (intercept) to the predictor X_with_const = sm.add_constant(X) # Fit the OLS model ols_model = sm.OLS(y, X_with_const).fit() # Print the full summary print(ols_model.summary())

The ols_model.summary() output includes:

  • R-squared and Adjusted R-squared — overall model fit
  • F-statistic and its p-value — tests if the model is statistically significant
  • Coefficient estimates with standard errors, t-values, p-values, and 95% CIs
  • AIC and BIC — information criteria for model comparison
  • Durbin-Watson statistic — tests for autocorrelation in residuals
Exam Tip: sm.add_constant()

Unlike scikit-learn (which automatically includes an intercept), statsmodels requires you to explicitly add a constant column using sm.add_constant(X). Forgetting this will fit a model with no intercept, forcing the line through the origin.

Key Values to Interpret from statsmodels Output

Output FieldWhat It Tells You
coefThe estimated coefficient value (β)
std errStandard error of the coefficient — smaller is more precise
tt-statistic = coef / std err — tests if coef is significantly different from 0
P>|t|p-value — if < 0.05, the coefficient is statistically significant
[0.025, 0.975]95% confidence interval for the coefficient
R-squaredProportion of variance explained (0 to 1)
Adj. R-squaredR-squared adjusted for number of predictors
F-statisticTests overall significance of the model

Practice Quiz: Inferential Statistics

Q1. What is the defining characteristic of bootstrap sampling?
A) Sampling without replacement from the population
B) Sampling with replacement from the observed sample
C) Sampling new data from a theoretical distribution
D) Dividing the sample into equal-sized subsets
Correct: B. Bootstrapping draws samples of the same size as the original dataset with replacement from the observed sample. This is what generates variability across bootstrap samples and allows estimation of the sampling distribution.
Q2. A 95% bootstrap confidence interval using the percentile method uses which percentiles of the bootstrap distribution?
A) 0th and 95th
B) 2.5th and 97.5th
C) 5th and 95th
D) 0.5th and 99.5th
Correct: B. A 95% CI leaves 2.5% in each tail, so we use np.percentile(bootstrap_dist, 2.5) and np.percentile(bootstrap_dist, 97.5). A 90% CI would use the 5th and 95th percentiles.
Q3. Which of the following is NOT an assumption of linear regression?
A) Linearity between predictors and response
B) Independence of observations
C) The dependent variable must be normally distributed
D) Homoscedasticity (constant variance of residuals)
Correct: C. The normality assumption applies to the residuals (errors), not to the dependent variable itself. The dependent variable does not need to follow a normal distribution. The four assumptions are Linearity, Independence, Homoscedasticity, and Normality of residuals (L-I-H-N).
Q4. In the equation y = β0 + β1x, what does β1 represent?
A) The predicted value of y when x = 0
B) The expected change in y for a one-unit increase in x
C) The proportion of variance explained by the model
D) The probability that the relationship is statistically significant
Correct: B. The slope coefficient β1 tells you the expected change in y for every one-unit increase in x. Option A describes the intercept β0, option C describes R-squared, and option D describes the p-value.
Q5. When should logistic regression be used instead of linear regression?
A) When the independent variable is categorical
B) When the dependent variable is binary (0/1)
C) When the dataset has more than 1000 rows
D) When the relationship between variables is nonlinear
Correct: B. Logistic regression is specifically designed for binary (two-class) dependent variables. It models the probability of belonging to the positive class. Linear regression is for continuous dependent variables. The type of independent variable does not determine the choice between linear and logistic regression.
Q6. What does an R-squared value of 0.82 mean?
A) The model correctly classifies 82% of observations
B) 82% of the variance in the dependent variable is explained by the model
C) The correlation between x and y is 0.82
D) There is an 82% chance the model is correct
Correct: B. R-squared (coefficient of determination) measures the proportion of variance in the dependent variable that is explained by the model. An R² of 0.82 means 82% of the variability in y is accounted for by the predictor(s). Note: the correlation would be √0.82 ≈ 0.906.
Q7. In logistic regression, if a coefficient β1 = 0.693, what is the odds ratio?
A) 2.0
B) 0.693
C) 1.693
D) 0.5
Correct: A. The odds ratio is calculated as eβ = e0.693 ≈ 2.0. This means a one-unit increase in the predictor doubles the odds of the positive outcome. Note that ln(2) ≈ 0.693.
Q8. When using statsmodels OLS, why must you call sm.add_constant(X) before fitting?
A) To normalize the predictor variables
B) To remove missing values from the dataset
C) To include an intercept term in the regression model
D) To convert the data into a NumPy array
Correct: C. Unlike scikit-learn, which automatically includes an intercept, statsmodels requires you to explicitly add a column of ones to the predictor matrix using sm.add_constant(X). Without this, the regression line is forced through the origin (intercept = 0).
Q9. What is a key limitation of bootstrapping?
A) It can only be used with normally distributed data
B) It requires a minimum sample size of 1000
C) It cannot correct for a biased or unrepresentative original sample
D) It only works with continuous data, not discrete data
Correct: C. Bootstrapping relies entirely on the original sample as a stand-in for the population. If the original sample is biased or unrepresentative, all bootstrap estimates will inherit that bias. Bootstrapping does not create new information — it only estimates the variability of statistics from the existing data.
Q10. Which metric from a confusion matrix measures the proportion of all predictions that were correct?
A) Precision
B) Recall
C) Accuracy
D) F1-Score
Correct: C. Accuracy = (TP + TN) / (TP + TN + FP + FN), which is the total number of correct predictions divided by the total number of predictions. Precision measures correctness among positive predictions (TP / (TP + FP)). Recall measures how many actual positives were found (TP / (TP + FN)). F1-Score is the harmonic mean of precision and recall.

Navigation

3.2.1 Bootstrapping Resampling with Replacement Discrete vs Continuous When to Bootstrap Python Implementation Confidence Intervals 3.2.2 Regression Linear Regression Mathematical Foundation Logistic Regression Choosing Regression Type Python with sklearn statsmodels OLS Practice Quiz