Topic 3.2: Inferential Statistics

3.2.1 Bootstrapping for Sampling Distributions

Theoretical Basis and Statistical Principles

Bootstrapping is a resampling-based statistical method introduced by Bradley Efron in 1979. The core idea is elegantly simple: when you cannot derive the sampling distribution of a statistic analytically, you can approximate it by repeatedly resampling from the data you already have.

The statistical principle behind bootstrapping rests on the plug-in principle: we treat the observed sample as if it were the population and draw repeated samples from it. If our original sample is reasonably representative of the population, the distribution of a statistic across many bootstrap samples will approximate the true sampling distribution of that statistic.

Key Insight for the Exam

Bootstrapping does not create new data or new information. It leverages the existing sample to estimate the variability of a statistic (e.g., mean, median, regression coefficient) without relying on parametric assumptions about the underlying population distribution.

Key Principles

Empirical Distribution Function (EDF): The bootstrap treats the sample as a stand-in for the population. Each observed value has an equal probability of being selected: 1/n.
Law of Large Numbers: As the number of bootstrap replicates (B) increases, the bootstrap distribution converges to the true sampling distribution of the statistic.
Non-parametric inference: Bootstrapping makes no assumptions about the shape of the underlying distribution (normal, uniform, etc.).

Resampling with Replacement

The defining characteristic of bootstrapping is sampling with replacement. Given an original sample of size n, each bootstrap sample is drawn by randomly selecting n observations from the original sample, where each observation can be selected more than once (or not at all).

Concept	Description
Original Sample	The dataset you collected (size n)
Bootstrap Sample	A sample of size n drawn with replacement from the original
Bootstrap Replicate	The statistic computed on one bootstrap sample
Bootstrap Distribution	The collection of all bootstrap replicates (typically B = 1,000 to 10,000)

Because we sample with replacement, any individual bootstrap sample will typically contain some duplicated observations and omit some original observations. On average, about 63.2% of the original observations appear in any given bootstrap sample (the rest are left out, forming the "out-of-bag" sample).

Why With Replacement?

Without replacement, every sample of size n from a dataset of size n would be identical to the original — there would be no variability. Replacement is what introduces the randomness needed to estimate sampling variability.

Discrete vs Continuous Data in Bootstrapping

Bootstrapping applies to both discrete and continuous data, but there are important differences:

Aspect	Discrete Data	Continuous Data
Example	Count of defective items, survey responses (1-5)	Heights, temperatures, stock prices
Bootstrap behavior	Bootstrap samples have many tied values; the bootstrap distribution of the statistic tends to be "lumpy"	More variety in bootstrap samples; bootstrap distribution is smoother
Effective?	Yes, but may need more replicates for stable estimates	Typically works well with standard B values
Caution	Small discrete samples can produce unreliable bootstrap distributions	Extreme outliers can disproportionately affect results

When Bootstrapping Is Effective vs Theoretical Distributions

Bootstrapping is not always the best choice. Understanding when to use it versus parametric methods is critical:

Use bootstrapping when:

The theoretical sampling distribution is unknown or difficult to derive (e.g., median, trimmed mean, ratio statistics)
The data is not normally distributed and sample sizes are small
You want to estimate confidence intervals for complex statistics
You are working with non-standard estimators (e.g., a custom metric)

Prefer theoretical distributions when:

Assumptions are met (e.g., large sample, normality) — the Central Limit Theorem provides exact results
Computational cost matters and parametric formulas are available
The sample size is very small (n < 10) — bootstrap can be unreliable

Exam Warning

Bootstrapping does not fix a biased or unrepresentative sample. If your original sample poorly represents the population, bootstrap estimates will also be poor. The quality of bootstrap inference depends entirely on the quality of the original sample.

Generating Bootstrap Samples in Python

Here is how to generate bootstrap samples and analyze their distribution using NumPy:

import numpy as np

# Original sample data
data = np.array([12, 15, 14, 10, 18, 22, 19, 16, 13, 17])

# Set random seed for reproducibility
np.random.seed(42)

# Number of bootstrap replicates
n_bootstrap = 10000

# Generate bootstrap replicates of the mean
bootstrap_means = np.array([
    np.mean(np.random.choice(data, size=len(data), replace=True))
    for _ in range(n_bootstrap)
])

# Analyze the bootstrap distribution
print(f"Original sample mean: {np.mean(data):.2f}")
print(f"Bootstrap mean of means: {np.mean(bootstrap_means):.2f}")
print(f"Bootstrap std error: {np.std(bootstrap_means, ddof=1):.2f}")

You can also bootstrap other statistics, such as the median:

# Bootstrap replicates of the median
bootstrap_medians = np.array([
    np.median(np.random.choice(data, size=len(data), replace=True))
    for _ in range(n_bootstrap)
])

print(f"Original sample median: {np.median(data):.2f}")
print(f"Bootstrap mean of medians: {np.mean(bootstrap_medians):.2f}")
print(f"Bootstrap std error of median: {np.std(bootstrap_medians, ddof=1):.2f}")

Analyzing Bootstrap Distributions

To assess the reliability and validity of bootstrap results, you should examine:

Shape: A bell-shaped bootstrap distribution indicates stable estimates
Center: The mean of the bootstrap distribution should be close to the original statistic
Spread: The standard deviation of bootstrap replicates is the bootstrap standard error
Bias: The difference between the bootstrap mean and the original statistic

Confidence Intervals from Bootstrapping

There are two common methods for constructing bootstrap confidence intervals:

1. Percentile Method (Most Common on Exams)

Simply take the desired percentiles of the bootstrap distribution:

# 95% confidence interval using the percentile method
ci_lower = np.percentile(bootstrap_means, 2.5)
ci_upper = np.percentile(bootstrap_means, 97.5)

print(f"95% Bootstrap CI (percentile): [{ci_lower:.2f}, {ci_upper:.2f}]")

2. Standard Error Method

Uses the bootstrap standard error with a normal approximation:

# 95% CI using the standard error method
sample_mean = np.mean(data)
boot_se = np.std(bootstrap_means, ddof=1)

ci_lower_se = sample_mean - 1.96 * boot_se
ci_upper_se = sample_mean + 1.96 * boot_se

print(f"95% Bootstrap CI (SE method): [{ci_lower_se:.2f}, {ci_upper_se:.2f}]")

Exam Tip

For the exam, focus on the percentile method. Remember: a 95% CI uses the 2.5th and 97.5th percentiles, and a 90% CI uses the 5th and 95th percentiles of the bootstrap distribution.

3.2.2 Linear and Logistic Regression

Linear Regression: Theory and Assumptions

Linear regression models the relationship between a continuous dependent variable (y) and one or more independent variables (X). It is the foundation of predictive modeling and one of the most important topics for the exam.

The Four Key Assumptions of Linear Regression

Assumption	Description	What Happens if Violated
Linearity	The relationship between X and y is linear	Model will systematically mis-predict; residuals show a pattern
Independence	Observations are independent of each other	Standard errors are incorrect; confidence intervals are unreliable
Homoscedasticity	The variance of residuals is constant across all levels of X	Coefficient estimates are still unbiased, but standard errors and p-values are wrong
Normality of Residuals	The residuals (errors) follow a normal distribution	Affects inference (p-values, CIs) but coefficients remain unbiased

Remember (L-I-H-N)

The four assumptions: Linearity, Independence, Homoscedasticity, Normality of residuals. Note that normality applies to the residuals, not the variables themselves.

Mathematical Foundation

The simple linear regression equation is:

y = mx + b (or equivalently, y = β₀ + β₁x + ε)

Symbol	Meaning
y	Dependent (response) variable — what we are predicting
x	Independent (predictor) variable — input feature
β₀ (b)	Intercept — the predicted value of y when x = 0
β₁ (m)	Slope — the change in y for a one-unit increase in x
ε	Error term — captures variation not explained by the model

Least Squares Estimation

The Ordinary Least Squares (OLS) method finds the values of β₀ and β₁ that minimize the sum of squared residuals (SSR):

SSR = Σ(y_i - ŷ_i)² → minimize

Where ŷ_i is the predicted value for observation i. The OLS solution gives us the "best fit" line that passes as close as possible to all data points.

Interpreting Coefficients and R-squared

Interpreting Coefficients

Intercept (β₀): The expected value of y when all predictors are 0. Sometimes this has a meaningful interpretation; other times it is just a mathematical anchor.
Slope (β₁): For every one-unit increase in x, y is expected to change by β₁ units, holding all other variables constant.

R-squared (Coefficient of Determination)

R-squared measures the proportion of variance in y explained by the model:

R² = 1 - (SS_res / SS_tot)

R² = 0: The model explains none of the variance (no better than predicting the mean)
R² = 1: The model explains all of the variance (perfect fit)
R² = 0.75: The model explains 75% of the variance in y

Limitations and Potential Biases

R² always increases when you add more predictors, even irrelevant ones. Use Adjusted R² for multiple regression.
Correlation does not imply causation — a strong R² does not mean x causes y.
Linear regression is sensitive to outliers, which can dramatically shift the regression line.
Multicollinearity (correlated predictors) inflates standard errors and makes coefficients unstable.
Extrapolation beyond the range of observed data is dangerous and unreliable.

Logistic Regression: Concepts and Use Cases

While linear regression predicts a continuous outcome, logistic regression is used when the dependent variable is binary (0 or 1, Yes/No, True/False). It models the probability that an observation belongs to the positive class.

Common use cases:

Will a customer churn? (Yes/No)
Is an email spam? (Spam/Not Spam)
Will a patient be readmitted? (Readmitted/Not Readmitted)
Will a loan default? (Default/No Default)

The Sigmoid Function

Logistic regression uses the sigmoid (logistic) function to map any real-valued input to a probability between 0 and 1:

p(y=1|x) = 1 / (1 + e^{-(β₀ + β₁x)})

The sigmoid curve has an S-shape: for very negative inputs, the probability approaches 0; for very positive inputs, it approaches 1; and at the midpoint (β₀ + β₁x = 0), the probability is exactly 0.5.

Log-Odds (Logit)

Logistic regression is a linear model in log-odds space. The log-odds (logit) is:

log(p / (1 - p)) = β₀ + β₁x

This transformation converts a probability (bounded 0-1) into a value that can range from -∞ to +∞, allowing a linear model to work.

Interpreting Logistic Regression Coefficients

In logistic regression, coefficients are interpreted as changes in log-odds:

β₁ = 0.5: A one-unit increase in x increases the log-odds of the positive class by 0.5
To convert to an odds ratio: OR = e^β₁ = e^0.5 ≈ 1.65
Interpretation: A one-unit increase in x multiplies the odds of the positive outcome by 1.65 (a 65% increase in odds)

Coefficient Value	Effect on Odds	Effect on Probability
β > 0	Odds increase (OR > 1)	Probability increases
β = 0	No effect (OR = 1)	No change
β < 0	Odds decrease (OR < 1)	Probability decreases

Exam Tip

Remember that logistic regression coefficients are NOT directly interpretable as changes in probability. They represent changes in log-odds. To get odds ratios, exponentiate the coefficient: OR = e^β.

Choosing Between Linear and Logistic Regression

Criterion	Linear Regression	Logistic Regression
Dependent Variable	Continuous (e.g., price, temperature, score)	Binary / Categorical (e.g., yes/no, pass/fail)
Output	A predicted value (any real number)	A probability between 0 and 1
Function	y = β₀ + β₁x	Sigmoid / logistic function
Loss Function	Sum of Squared Errors (SSE)	Log-Loss (Cross-Entropy)
Key Metric	R-squared, RMSE, MAE	Accuracy, Precision, Recall, AUC-ROC
Coefficient Meaning	Change in y per unit change in x	Change in log-odds per unit change in x

Discrete vs Continuous Data Considerations

Continuous predictors work naturally in both linear and logistic regression
Discrete/categorical predictors must be encoded (e.g., one-hot encoding, label encoding) before use in regression models
Continuous target → use linear regression
Binary/discrete target → use logistic regression
Applying linear regression to a binary target produces probabilities outside [0, 1], which is why logistic regression exists

Python Implementation with scikit-learn

Linear Regression with sklearn

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error

# Sample data: hours studied vs exam score
hours = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]).reshape(-1, 1)
scores = np.array([45, 50, 55, 60, 62, 70, 72, 78, 82, 90])

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    hours, scores, test_size=0.2, random_state=42
)

# Create and fit the model
model = LinearRegression()
model.fit(X_train, y_train)

# Get coefficients
print(f"Intercept (beta_0): {model.intercept_:.2f}")
print(f"Slope (beta_1): {model.coef_[0]:.2f}")

# Predictions and evaluation
y_pred = model.predict(X_test)
print(f"R-squared: {r2_score(y_test, y_pred):.4f}")
print(f"RMSE: {np.sqrt(mean_squared_error(y_test, y_pred)):.2f}")

Plotting the Regression Line

# Plot data points and regression line
plt.figure(figsize=(8, 5))
plt.scatter(hours, scores, color='steelblue', s=80,
            edgecolors='navy', label='Data Points')

# Generate predictions for the regression line
X_line = np.linspace(0, 11, 100).reshape(-1, 1)
y_line = model.predict(X_line)

plt.plot(X_line, y_line, color='red', linewidth=2,
         label=f'y = {model.coef_[0]:.2f}x + {model.intercept_:.2f}')

plt.xlabel('Hours Studied')
plt.ylabel('Exam Score')
plt.title('Linear Regression: Hours vs Exam Score')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

Logistic Regression with sklearn

import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix

# Sample data: hours studied vs pass/fail (1 = pass, 0 = fail)
hours = np.array([1, 2, 2, 3, 3, 4, 5, 5, 6, 6,
                  7, 7, 8, 8, 9, 9, 10, 10, 11, 12]).reshape(-1, 1)
passed = np.array([0, 0, 0, 0, 0, 0, 1, 0, 1, 1,
                   0, 1, 1, 1, 1, 1, 1, 1, 1, 1])

# Split and train
X_train, X_test, y_train, y_test = train_test_split(
    hours, passed, test_size=0.3, random_state=42
)

log_model = LogisticRegression()
log_model.fit(X_train, y_train)

# Coefficients
print(f"Intercept: {log_model.intercept_[0]:.4f}")
print(f"Coefficient: {log_model.coef_[0][0]:.4f}")
print(f"Odds Ratio: {np.exp(log_model.coef_[0][0]):.4f}")

# Predictions
y_pred = log_model.predict(X_test)
y_prob = log_model.predict_proba(X_test)[:, 1]

# Evaluation
print(f"\nAccuracy: {accuracy_score(y_test, y_pred):.2f}")
print(f"Confusion Matrix:\n{confusion_matrix(y_test, y_pred)}")

Model Evaluation Metrics

from sklearn.metrics import (
    r2_score, mean_squared_error, mean_absolute_error,  # for linear
    accuracy_score, confusion_matrix, classification_report  # for logistic
)

# --- Linear Regression Metrics ---
# Assume y_test and y_pred_linear exist from a linear regression
r2 = r2_score(y_test_lin, y_pred_linear)
rmse = np.sqrt(mean_squared_error(y_test_lin, y_pred_linear))
mae = mean_absolute_error(y_test_lin, y_pred_linear)

print("Linear Regression Metrics:")
print(f"  R-squared: {r2:.4f}")
print(f"  RMSE: {rmse:.4f}")
print(f"  MAE: {mae:.4f}")

# --- Logistic Regression Metrics ---
# Assume y_test and y_pred_logistic exist from a logistic regression
acc = accuracy_score(y_test_log, y_pred_logistic)
cm = confusion_matrix(y_test_log, y_pred_logistic)

print("\nLogistic Regression Metrics:")
print(f"  Accuracy: {acc:.4f}")
print(f"  Confusion Matrix:")
print(cm)
print(f"\n  Classification Report:")
print(classification_report(y_test_log, y_pred_logistic))

Understanding the Confusion Matrix

	Predicted Negative	Predicted Positive
Actual Negative	True Negative (TN)	False Positive (FP)
Actual Positive	False Negative (FN)	True Positive (TP)

Accuracy = (TP + TN) / (TP + TN + FP + FN)

statsmodels OLS for Detailed Regression Summary

While scikit-learn is great for prediction, statsmodels provides detailed statistical summaries including p-values, confidence intervals, and diagnostic statistics:

import statsmodels.api as sm
import numpy as np

# Sample data
X = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
y = np.array([45, 50, 55, 60, 62, 70, 72, 78, 82, 90])

# IMPORTANT: add a constant (intercept) to the predictor
X_with_const = sm.add_constant(X)

# Fit the OLS model
ols_model = sm.OLS(y, X_with_const).fit()

# Print the full summary
print(ols_model.summary())

The ols_model.summary() output includes:

R-squared and Adjusted R-squared — overall model fit
F-statistic and its p-value — tests if the model is statistically significant
Coefficient estimates with standard errors, t-values, p-values, and 95% CIs
AIC and BIC — information criteria for model comparison
Durbin-Watson statistic — tests for autocorrelation in residuals

Exam Tip: sm.add_constant()

Unlike scikit-learn (which automatically includes an intercept), statsmodels requires you to explicitly add a constant column using sm.add_constant(X). Forgetting this will fit a model with no intercept, forcing the line through the origin.

Key Values to Interpret from statsmodels Output

Output Field	What It Tells You
`coef`	The estimated coefficient value (β)
`std err`	Standard error of the coefficient — smaller is more precise
`t`	t-statistic = coef / std err — tests if coef is significantly different from 0
`P>\|t\|`	p-value — if < 0.05, the coefficient is statistically significant
`[0.025, 0.975]`	95% confidence interval for the coefficient
`R-squared`	Proportion of variance explained (0 to 1)
`Adj. R-squared`	R-squared adjusted for number of predictors
`F-statistic`	Tests overall significance of the model

Practice Quiz: Inferential Statistics

Q1. What is the defining characteristic of bootstrap sampling?

A) Sampling without replacement from the population

B) Sampling with replacement from the observed sample

C) Sampling new data from a theoretical distribution

D) Dividing the sample into equal-sized subsets

Correct: B. Bootstrapping draws samples of the same size as the original dataset with replacement from the observed sample. This is what generates variability across bootstrap samples and allows estimation of the sampling distribution.

Q2. A 95% bootstrap confidence interval using the percentile method uses which percentiles of the bootstrap distribution?

A) 0th and 95th

B) 2.5th and 97.5th

C) 5th and 95th

D) 0.5th and 99.5th

Correct: B. A 95% CI leaves 2.5% in each tail, so we use np.percentile(bootstrap_dist, 2.5) and np.percentile(bootstrap_dist, 97.5). A 90% CI would use the 5th and 95th percentiles.

Q3. Which of the following is NOT an assumption of linear regression?

A) Linearity between predictors and response

B) Independence of observations

C) The dependent variable must be normally distributed

D) Homoscedasticity (constant variance of residuals)

Correct: C. The normality assumption applies to the residuals (errors), not to the dependent variable itself. The dependent variable does not need to follow a normal distribution. The four assumptions are Linearity, Independence, Homoscedasticity, and Normality of residuals (L-I-H-N).

Q4. In the equation y = β₀ + β₁x, what does β₁ represent?

A) The predicted value of y when x = 0

B) The expected change in y for a one-unit increase in x

C) The proportion of variance explained by the model

D) The probability that the relationship is statistically significant

Correct: B. The slope coefficient β₁ tells you the expected change in y for every one-unit increase in x. Option A describes the intercept β₀, option C describes R-squared, and option D describes the p-value.

Q5. When should logistic regression be used instead of linear regression?

A) When the independent variable is categorical

B) When the dependent variable is binary (0/1)

C) When the dataset has more than 1000 rows

D) When the relationship between variables is nonlinear

Correct: B. Logistic regression is specifically designed for binary (two-class) dependent variables. It models the probability of belonging to the positive class. Linear regression is for continuous dependent variables. The type of independent variable does not determine the choice between linear and logistic regression.

Q6. What does an R-squared value of 0.82 mean?

A) The model correctly classifies 82% of observations

B) 82% of the variance in the dependent variable is explained by the model

C) The correlation between x and y is 0.82

D) There is an 82% chance the model is correct

Correct: B. R-squared (coefficient of determination) measures the proportion of variance in the dependent variable that is explained by the model. An R² of 0.82 means 82% of the variability in y is accounted for by the predictor(s). Note: the correlation would be √0.82 ≈ 0.906.

Q7. In logistic regression, if a coefficient β₁ = 0.693, what is the odds ratio?

A) 2.0

B) 0.693

C) 1.693

D) 0.5

Correct: A. The odds ratio is calculated as e^β = e^0.693 ≈ 2.0. This means a one-unit increase in the predictor doubles the odds of the positive outcome. Note that ln(2) ≈ 0.693.

Q8. When using statsmodels OLS, why must you call sm.add_constant(X) before fitting?

A) To normalize the predictor variables

B) To remove missing values from the dataset

C) To include an intercept term in the regression model

D) To convert the data into a NumPy array

Correct: C. Unlike scikit-learn, which automatically includes an intercept, statsmodels requires you to explicitly add a column of ones to the predictor matrix using sm.add_constant(X). Without this, the regression line is forced through the origin (intercept = 0).

Q9. What is a key limitation of bootstrapping?

A) It can only be used with normally distributed data

B) It requires a minimum sample size of 1000

C) It cannot correct for a biased or unrepresentative original sample

D) It only works with continuous data, not discrete data

Correct: C. Bootstrapping relies entirely on the original sample as a stand-in for the population. If the original sample is biased or unrepresentative, all bootstrap estimates will inherit that bias. Bootstrapping does not create new information — it only estimates the variability of statistics from the existing data.

Q10. Which metric from a confusion matrix measures the proportion of all predictions that were correct?

A) Precision

B) Recall

C) Accuracy

D) F1-Score

Correct: C. Accuracy = (TP + TN) / (TP + TN + FP + FN), which is the total number of correct predictions divided by the total number of predictions. Precision measures correctness among positive predictions (TP / (TP + FP)). Recall measures how many actual positives were found (TP / (TP + FN)). F1-Score is the harmonic mean of precision and recall.

Previous Next