Topic 4.2: Statistical Methods and Machine Learning
Block 4: Data Analysis and Modeling
Learning Objectives (3)
4.2.1 Descriptive Statistics for Dataset Analysis
4.2.2 Importance of Test Datasets
4.2.3 Supervised Learning Algorithms and Model Accuracy
4.2.1 Descriptive Statistics for Dataset Analysis
Descriptive statistics summarize and describe the main characteristics of a dataset. They provide a quick overview of data distribution, central tendency, and variability, forming the foundation for all data analysis.
Key Measures
Measure
What It Tells You
Formula / Concept
Mean
Average value — the center of the data
Sum of all values / Number of values
Median
Middle value when data is sorted; robust to outliers
50th percentile
Mode
Most frequently occurring value
Value with highest frequency
Variance
Average squared deviation from the mean
Σ(xi - x̄)² / (n - 1)
Standard Deviation
Spread of values around the mean (in original units)
√Variance
Exam Tip:
The mean is sensitive to outliers while the median is robust. If a dataset is heavily skewed (e.g., income data), the median is a better measure of central tendency.
Using Pandas for Descriptive Statistics
Pandas provides built-in methods that make computing descriptive statistics on DataFrames and Series straightforward.
Key Detail: ddof parameter
By default, Pandas uses ddof=1 (sample standard deviation/variance), while NumPy uses ddof=0 (population). This is a common exam question. You can specify df['col'].std(ddof=0) to match NumPy's default behavior.
Using NumPy for Descriptive Statistics
NumPy provides efficient functions for computing statistics on arrays. It is especially useful for large numerical datasets.
# Descriptive statistics with NumPy
import numpy as np
salaries = np.array([55000, 72000, 95000, 60000, 58000,
110000, 78000, 85000, 48000, 67000])
print("Mean:", np.mean(salaries)) # 72800.0
print("Median:", np.median(salaries)) # 69500.0
print("Std (pop):", np.std(salaries)) # 18261.28 (ddof=0)
print("Std (samp):", np.std(salaries, ddof=1)) # 19236.17
print("Var (pop):", np.var(salaries)) # 333474000.0 (ddof=0)
print("Var (samp):", np.var(salaries, ddof=1)) # 370526666.67
# Percentiles and quantiles
print("25th percentile:", np.percentile(salaries, 25)) # 58500.0
print("75th percentile:", np.percentile(salaries, 75)) # 83250.0
print("IQR:", np.percentile(salaries, 75) - np.percentile(salaries, 25))
# NumPy does NOT have a built-in mode function
# Use scipy.stats.mode() or pandas .mode() instead
from scipy import stats
mode_result = stats.mode(salaries, keepdims=True)
print("Mode:", mode_result.mode[0])
Practical Application: Trends, Anomalies, and Decisions
Descriptive statistics serve three critical purposes in data analysis:
1. Identifying Trends
Comparing means, medians, and distributions across groups or time periods reveals patterns. For example, comparing average sales by quarter identifies seasonal trends.
2. Detecting Anomalies
Values that fall outside expected ranges (e.g., more than 2-3 standard deviations from the mean) signal potential data quality issues or genuinely unusual observations.
Summary statistics help stakeholders understand data at a glance without needing to examine every record. A manager can review average performance scores, salary distributions, and score variability to allocate bonuses.
Common Exam Pitfall:
Remember that df.describe() only includes numeric columns by default. To include non-numeric columns, use df.describe(include='all'). It shows count, unique, top, and freq for object columns.
4.2.2 Importance of Test Datasets
In machine learning, evaluating a model on the same data it was trained on gives a misleadingly optimistic view of its performance. Test datasets are held-out portions of data that the model has never seen, used exclusively to assess generalization ability.
Why We Split Data
Concept
Purpose
Training Set
Used to train (fit) the model — the model learns patterns from this data
Test Set
Used to evaluate performance — measures how well the model generalizes to unseen data
Validation Set (optional)
Used during training to tune hyperparameters — prevents overfitting to test set
A typical split ratio is 80% training / 20% testing, though 70/30 and 75/25 are also common. The key rule is: never use test data during training.
Train/Test Split with sklearn
# Train/Test Split using scikit-learn
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
# Create a sample dataset
np.random.seed(42)
n = 200
X = pd.DataFrame({
'feature_1': np.random.randn(n),
'feature_2': np.random.randn(n),
'feature_3': np.random.uniform(0, 10, n)
})
y = (X['feature_1'] * 2 + X['feature_2'] + np.random.randn(n) * 0.5) > 0
y = y.astype(int)
# Split: 80% train, 20% test
X_train, X_test, y_train, y_test = train_test_split(
X, y,
test_size=0.2, # 20% for testing
random_state=42, # reproducibility
stratify=y # maintain class proportions
)
print(f"Training set size: {len(X_train)}") # 160
print(f"Test set size: {len(X_test)}") # 40
print(f"Training class distribution:\n{y_train.value_counts(normalize=True)}")
print(f"Test class distribution:\n{y_test.value_counts(normalize=True)}")
Key Parameters of train_test_split():
test_size — Fraction (0.0 to 1.0) or absolute number of test samples
random_state — Seed for reproducibility; ensures the same split every time
stratify — Ensures the train and test sets have the same proportion of each class (critical for imbalanced datasets)
shuffle — Whether to shuffle data before splitting (default: True)
Data Leakage and Best Practices
Data leakage occurs when information from outside the training dataset is used to create the model. This leads to unrealistically high performance during evaluation but poor results in production.
Common Sources of Data Leakage
Target leakage: Using features that are derived from or directly correlated with the target variable (e.g., using "total_charges" to predict "is_subscribed" when charges only exist for subscribers)
Train-test contamination: Normalizing or imputing missing values on the full dataset before splitting — test statistics leak into training
Temporal leakage: Using future data to predict the past (e.g., using Friday sales data to predict Thursday demand)
Best Practices for Unbiased Evaluation
Split first, then preprocess. Fit scalers and imputers on training data only, then transform test data.
Use stratification for classification tasks to maintain class balance.
Set a random_state for reproducible results.
Never peek at test data during feature engineering or model selection.
Use cross-validation for more reliable estimates of performance.
# CORRECT: Fit scaler on training data, transform both
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train) # fit + transform
X_test_scaled = scaler.transform(X_test) # transform only!
# WRONG: This causes data leakage!
# scaler.fit_transform(X) # fitting on ALL data (train + test)
Data Leakage Warning:
Always apply fit_transform() on the training set and transform() on the test set. Fitting on the entire dataset leaks test set statistics into the training process.
4.2.3 Supervised Learning Algorithms and Model Accuracy
Supervised learning uses labeled data (input-output pairs) to train models that can predict outcomes for new, unseen data. There are two main categories:
Type
Target Variable
Goal
Example Algorithms
Classification
Categorical (discrete)
Predict a class label
Logistic Regression, Decision Trees, SVM, KNN, Random Forest
Regression
Continuous (numeric)
Predict a numerical value
Linear Regression, Polynomial Regression, Ridge, Lasso
Characteristics of Common Algorithms
Algorithm
Type
Key Characteristics
Linear Regression
Regression
Assumes linear relationship; simple, interpretable; prone to underfitting with complex data
Logistic Regression
Classification
Outputs probabilities via sigmoid; works for binary/multi-class; assumes linear decision boundary
Decision Trees
Both
Splits data recursively; highly interpretable; prone to overfitting without pruning
Random Forest
Both
Ensemble of decision trees; reduces overfitting; less interpretable than single tree
K-Nearest Neighbors (KNN)
Both
Instance-based; no training phase; sensitive to feature scaling and choice of k
Overfitting, Underfitting, and the Bias-Variance Tradeoff
Overfitting
Overfitting occurs when a model is too complex and learns the noise in the training data rather than the underlying pattern. It performs exceptionally well on training data but poorly on unseen test data.
High training accuracy, low test accuracy
The model has memorized the training examples
Too many parameters relative to the number of data points
Example: A high-degree polynomial that passes through every training point
Underfitting
Underfitting occurs when a model is too simple to capture the underlying patterns in the data. It performs poorly on both training and test data.
Low training accuracy, low test accuracy
The model fails to learn the relationship between features and target
Too few parameters or overly restrictive assumptions
Example: Fitting a straight line to clearly curved data
The Bias-Variance Tradeoff
Model error can be decomposed into three components: bias, variance, and irreducible error (noise). The goal is to find the sweet spot that minimizes total error.
Component
Definition
Associated With
Bias
Error from overly simplistic assumptions; the model consistently misses the true pattern
Underfitting (too simple)
Variance
Error from sensitivity to small fluctuations in training data; predictions vary widely
Overfitting (too complex)
Irreducible Error
Noise inherent in the data; cannot be reduced by any model
Data itself
How It Connects:
Linear regression tends toward high bias / low variance (underfitting) — it assumes a linear relationship regardless of true data patterns.
High-degree polynomial regression tends toward low bias / high variance (overfitting) — it can fit training data perfectly but wildly fluctuates on new data.
The optimal model complexity lies where total error (bias + variance) is minimized.
Strategies to Prevent Overfitting
Cross-validation: Evaluate model on multiple folds of data to get a robust performance estimate
Regularization: Add penalty terms (L1/Lasso, L2/Ridge) that constrain model complexity
Simplify the model: Use fewer features or lower polynomial degree
More training data: Larger datasets reduce the chance of memorizing noise
Early stopping: Stop training when validation performance starts decreasing
Pruning: Remove branches from decision trees that add little predictive power
Linear Regression: Fitting and Evaluation
# Linear Regression with scikit-learn
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np
# Generate sample regression data
np.random.seed(42)
X = np.random.rand(100, 1) * 10 # Feature: values 0-10
y = 2.5 * X.flatten() + 5 + np.random.randn(100) * 2 # y = 2.5x + 5 + noise
# Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Create and fit the model
model = LinearRegression()
model.fit(X_train, y_train)
# Model parameters
print(f"Coefficient (slope): {model.coef_[0]:.4f}") # ~2.5
print(f"Intercept: {model.intercept_:.4f}") # ~5.0
# Predictions
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)
# Evaluation metrics
print(f"\nTraining R-squared: {r2_score(y_train, y_train_pred):.4f}")
print(f"Test R-squared: {r2_score(y_test, y_test_pred):.4f}")
print(f"Training MSE: {mean_squared_error(y_train, y_train_pred):.4f}")
print(f"Test MSE: {mean_squared_error(y_test, y_test_pred):.4f}")
# R-squared interpretation:
# - 1.0 = perfect predictions
# - 0.0 = model no better than predicting the mean
# - Negative = model worse than predicting the mean
Logistic Regression: Classification
# Logistic Regression for binary classification
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (accuracy_score, confusion_matrix,
classification_report)
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
import numpy as np
# Load the Iris dataset (simplified to binary: setosa vs not-setosa)
iris = load_iris()
X = iris.data
y = (iris.target == 0).astype(int) # 1 = setosa, 0 = not setosa
# Split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# Create and fit logistic regression
log_model = LogisticRegression(random_state=42, max_iter=200)
log_model.fit(X_train, y_train)
# Predictions
y_pred = log_model.predict(X_test)
y_prob = log_model.predict_proba(X_test) # probability estimates
# Evaluation
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))
# Predicted 0 Predicted 1
# Actual 0 [ TN FP ]
# Actual 1 [ FN TP ]
print("\nClassification Report:")
print(classification_report(y_test, y_pred))
# precision recall f1-score support
# 0 1.00 1.00 1.00 20
# 1 1.00 1.00 1.00 10
# accuracy 1.00 30
Confusion Matrix Explained:
True Positive (TP): Correctly predicted positive
True Negative (TN): Correctly predicted negative
False Positive (FP): Incorrectly predicted positive (Type I error)
False Negative (FN): Incorrectly predicted negative (Type II error)
Demonstrating Overfitting with Polynomial Regression
# Demonstrating overfitting vs underfitting with polynomial regression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np
# Generate non-linear data: y = sin(x) + noise
np.random.seed(42)
X = np.sort(np.random.uniform(0, 6, 30)).reshape(-1, 1)
y = np.sin(X).flatten() + np.random.randn(30) * 0.2
# Split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42
)
# Compare different complexities
degrees = [1, 4, 15] # underfitting, good fit, overfitting
for degree in degrees:
# Create polynomial regression pipeline
model = make_pipeline(
PolynomialFeatures(degree),
LinearRegression()
)
model.fit(X_train, y_train)
# Evaluate on both sets
train_score = r2_score(y_train, model.predict(X_train))
test_score = r2_score(y_test, model.predict(X_test))
train_mse = mean_squared_error(y_train, model.predict(X_train))
test_mse = mean_squared_error(y_test, model.predict(X_test))
print(f"\nDegree {degree}:")
print(f" Train R2: {train_score:.4f} | Test R2: {test_score:.4f}")
print(f" Train MSE: {train_mse:.4f} | Test MSE: {test_mse:.4f}")
# Expected output pattern:
# Degree 1: Low train R2, Low test R2 -> UNDERFITTING
# Degree 4: Good train R2, Good test R2 -> GOOD FIT
# Degree 15: ~1.0 train R2, Low test R2 -> OVERFITTING
Key Takeaway:
When training R² is much higher than test R², the model is overfitting. When both are low, the model is underfitting. The best model has similar (and high) scores on both sets.
Cross-Validation
Cross-validation provides a more reliable estimate of model performance by training and evaluating the model on multiple different subsets of the data.
# K-Fold Cross-Validation
from sklearn.model_selection import cross_val_score, KFold
from sklearn.linear_model import LinearRegression
import numpy as np
# Generate sample data
np.random.seed(42)
X = np.random.rand(100, 3)
y = 3 * X[:, 0] + 2 * X[:, 1] - X[:, 2] + np.random.randn(100) * 0.5
# Create model
model = LinearRegression()
# 5-Fold Cross-Validation
kfold = KFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=kfold, scoring='r2')
print("Fold R2 scores:", scores)
print(f"Mean R2: {scores.mean():.4f}")
print(f"Std R2: {scores.std():.4f}")
# Cross-validation for classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.datasets import load_iris
iris = load_iris()
log_model = LogisticRegression(max_iter=200)
# 5-fold cross-validation with accuracy scoring
cv_scores = cross_val_score(log_model, iris.data, iris.target,
cv=5, scoring='accuracy')
print(f"\nLogistic Regression CV Accuracy: {cv_scores.mean():.4f} (+/- {cv_scores.std():.4f})")
# How K-Fold works:
# Fold 1: [Test] [Train] [Train] [Train] [Train]
# Fold 2: [Train] [Test] [Train] [Train] [Train]
# Fold 3: [Train] [Train] [Test] [Train] [Train]
# Fold 4: [Train] [Train] [Train] [Test] [Train]
# Fold 5: [Train] [Train] [Train] [Train] [Test]
# Final score = average of all 5 fold scores
Model Accuracy Metrics
Classification Metrics
Metric
Formula
Best For
Accuracy
(TP + TN) / Total
Balanced datasets
Precision
TP / (TP + FP)
When false positives are costly (spam detection)
Recall (Sensitivity)
TP / (TP + FN)
When false negatives are costly (disease screening)
F1-Score
2 * (Precision * Recall) / (Precision + Recall)
Imbalanced datasets; harmonic mean of precision and recall
Regression Metrics
Metric
What It Measures
Ideal Value
R-squared (R²)
Proportion of variance in y explained by the model
1.0 (closer to 1 is better)
MSE (Mean Squared Error)
Average of squared prediction errors; penalizes large errors
0.0 (lower is better)
RMSE
Square root of MSE; in same units as target variable
0.0 (lower is better)
MAE (Mean Absolute Error)
Average absolute difference between predicted and actual
Use precision when the cost of false positives is high (e.g., spam filter marking important emails as spam).
Use recall when the cost of false negatives is high (e.g., failing to detect a disease).
Use F1-score when you need a balance between precision and recall, especially with imbalanced data.
Use R-squared and MSE for regression problems.
Practice Quiz: Statistical Methods and Machine Learning
Test your understanding with these 10 questions. Click on an option to check your answer.
Q1: What does the df.describe() method return by default?
A) Summary statistics for all columns including object types
B) Summary statistics (count, mean, std, min, 25%, 50%, 75%, max) for numeric columns only
C) Only the mean and standard deviation of all columns
D) A correlation matrix of all numeric columns
Correct: B. By default, df.describe() computes count, mean, std, min, 25th percentile, 50th percentile (median), 75th percentile, and max for numeric columns only. Use df.describe(include='all') to include non-numeric columns.
Q2: By default, np.std() computes the population standard deviation (ddof=0), while df.std() in Pandas computes the sample standard deviation (ddof=1). What is the practical difference?
A) There is no practical difference; both produce the same value
B) The sample standard deviation (ddof=1) divides by (n-1) and is slightly larger, providing an unbiased estimate for sample data
C) The population standard deviation (ddof=0) is always larger because it divides by n
D) Pandas always gives incorrect results for standard deviation
Correct: B. Dividing by (n-1) instead of n corrects for the bias in estimating the population variance from a sample. This is called Bessel's correction. The sample std (ddof=1) is slightly larger than the population std (ddof=0). Both NumPy and Pandas can use either, but their defaults differ.
Q3: Why is it critical to split data into training and testing sets before model evaluation?
A) To reduce the size of the dataset and speed up training
B) To ensure the model trains on only easy examples
C) To evaluate how well the model generalizes to unseen data, preventing overly optimistic performance estimates
D) Because scikit-learn requires the data to be split before training
Correct: C. Evaluating on the same data used for training gives misleadingly high performance metrics because the model has already memorized the training examples. A separate test set provides an honest estimate of how the model performs on new, unseen data.
Q4: What is data leakage?
A) When the dataset is too small to train a model
B) When information from outside the training set is used during model training, leading to unrealistically high performance
C) When a model underfits the data due to insufficient features
D) When training data is stored in an insecure database
Correct: B. Data leakage occurs when test set information bleeds into the training process. A common example is fitting a scaler on the entire dataset (train + test) before splitting, which allows the model to learn from test data statistics. Always split first, then preprocess.
Q5: A model achieves 99% accuracy on the training set but only 55% on the test set. This is most likely a case of:
A) Underfitting
B) Overfitting
C) A good generalization
D) Data leakage
Correct: B. A large gap between training accuracy (99%) and test accuracy (55%) is the hallmark of overfitting. The model has memorized the training data (including its noise) but cannot generalize to new data. The model is too complex for the amount of training data.
Q6: In the bias-variance tradeoff, a simple linear regression model applied to highly non-linear data tends to have:
A) High bias and low variance
B) Low bias and high variance
C) Low bias and low variance
D) High bias and high variance
Correct: A. Linear regression makes strong assumptions about linearity. When applied to non-linear data, it consistently underfits (high bias) because it cannot capture the true pattern. However, its predictions are stable across different training sets (low variance).
Q7: What is the purpose of the stratify parameter in train_test_split()?
A) To sort the data before splitting
B) To increase the size of the test set
C) To ensure both training and test sets have the same proportion of each class
D) To apply feature scaling automatically
Correct: C. The stratify parameter ensures that the class distribution in the original dataset is preserved in both the training and test splits. This is especially important for imbalanced datasets where one class is underrepresented.
Q8: Which metric is most appropriate when the cost of false negatives is very high (e.g., failing to diagnose a serious disease)?
A) Accuracy
B) Precision
C) Recall
D) R-squared
Correct: C. Recall = TP / (TP + FN) measures the proportion of actual positives that are correctly identified. When false negatives are costly (missing a disease diagnosis), maximizing recall ensures that as many true positive cases as possible are detected, even at the expense of more false positives.
Q9: In 5-fold cross-validation, how many times is the model trained and evaluated?
A) 1 time
B) 5 times, each using a different 20% of the data as the test set
C) 5 times, each using 50% of the data for training
D) 25 times, testing every combination of folds
Correct: B. In 5-fold cross-validation, the data is divided into 5 equal parts. The model is trained 5 times, each time using 4 folds (80%) for training and 1 fold (20%) for testing. Each fold serves as the test set exactly once. The final performance metric is the average across all 5 folds.
Q10: Which of the following strategies helps prevent overfitting?
A) Increasing model complexity by adding more polynomial features
B) Training the model for more iterations without a stopping criterion
C) Applying regularization (e.g., L1 or L2 penalty) and using cross-validation
D) Removing the test set and training on all available data
Correct: C. Regularization (L1/Lasso or L2/Ridge) adds a penalty term that discourages overly complex models by constraining large coefficients. Combined with cross-validation for robust performance estimation, these are the primary strategies to combat overfitting. Options A, B, and D would all increase or mask overfitting.