Topic 4.2: Statistical Methods and Machine Learning

Learning Objectives (3)

4.2.1 Descriptive Statistics for Dataset Analysis
4.2.2 Importance of Test Datasets
4.2.3 Supervised Learning Algorithms and Model Accuracy

4.2.1 Descriptive Statistics for Dataset Analysis

Descriptive statistics summarize and describe the main characteristics of a dataset. They provide a quick overview of data distribution, central tendency, and variability, forming the foundation for all data analysis.

Key Measures

Measure	What It Tells You	Formula / Concept
Mean	Average value — the center of the data	Sum of all values / Number of values
Median	Middle value when data is sorted; robust to outliers	50th percentile
Mode	Most frequently occurring value	Value with highest frequency
Variance	Average squared deviation from the mean	Σ(x_i - x̄)² / (n - 1)
Standard Deviation	Spread of values around the mean (in original units)	√Variance

Exam Tip: The mean is sensitive to outliers while the median is robust. If a dataset is heavily skewed (e.g., income data), the median is a better measure of central tendency.

Using Pandas for Descriptive Statistics

Pandas provides built-in methods that make computing descriptive statistics on DataFrames and Series straightforward.

# Descriptive statistics with Pandas import pandas as pd # Sample dataset: employee performance data = { 'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve', 'Frank', 'Grace', 'Hank', 'Ivy', 'Jack'], 'Age': [28, 34, 45, 31, 29, 52, 38, 41, 26, 33], 'Salary': [55000, 72000, 95000, 60000, 58000, 110000, 78000, 85000, 48000, 67000], 'Score': [82, 91, 78, 88, 95, 70, 85, 76, 92, 87] } df = pd.DataFrame(data) # df.describe() - comprehensive summary of numeric columns print(df.describe()) # Age Salary Score # count 10.00 10.00 10.00 # mean 35.70 72800.00 84.40 # std 8.22 19236.17 7.89 # min 26.00 48000.00 70.00 # 25% 29.25 58500.00 78.50 # 50% 33.50 69500.00 85.50 # 75% 40.25 83250.00 90.25 # max 52.00 110000.00 95.00 # Individual statistic methods print("Mean salary:", df['Salary'].mean()) # 72800.0 print("Median salary:", df['Salary'].median()) # 69500.0 print("Mode of scores:", df['Score'].mode()[0]) # No single mode print("Std dev:", df['Salary'].std()) # 19236.17 print("Variance:", df['Salary'].var()) # 370030000.0 # Additional useful methods print("Min:", df['Score'].min()) # 70 print("Max:", df['Score'].max()) # 95 print("Sum:", df['Salary'].sum()) # 728000 print("Count:", df['Age'].count()) # 10

Key Detail: ddof parameter By default, Pandas uses ddof=1 (sample standard deviation/variance), while NumPy uses ddof=0 (population). This is a common exam question. You can specify df['col'].std(ddof=0) to match NumPy's default behavior.

Using NumPy for Descriptive Statistics

NumPy provides efficient functions for computing statistics on arrays. It is especially useful for large numerical datasets.

# Descriptive statistics with NumPy import numpy as np salaries = np.array([55000, 72000, 95000, 60000, 58000, 110000, 78000, 85000, 48000, 67000]) print("Mean:", np.mean(salaries)) # 72800.0 print("Median:", np.median(salaries)) # 69500.0 print("Std (pop):", np.std(salaries)) # 18261.28 (ddof=0) print("Std (samp):", np.std(salaries, ddof=1)) # 19236.17 print("Var (pop):", np.var(salaries)) # 333474000.0 (ddof=0) print("Var (samp):", np.var(salaries, ddof=1)) # 370526666.67 # Percentiles and quantiles print("25th percentile:", np.percentile(salaries, 25)) # 58500.0 print("75th percentile:", np.percentile(salaries, 75)) # 83250.0 print("IQR:", np.percentile(salaries, 75) - np.percentile(salaries, 25)) # NumPy does NOT have a built-in mode function # Use scipy.stats.mode() or pandas .mode() instead from scipy import stats mode_result = stats.mode(salaries, keepdims=True) print("Mode:", mode_result.mode[0])

Practical Application: Trends, Anomalies, and Decisions

Descriptive statistics serve three critical purposes in data analysis:

1. Identifying Trends

Comparing means, medians, and distributions across groups or time periods reveals patterns. For example, comparing average sales by quarter identifies seasonal trends.

2. Detecting Anomalies

Values that fall outside expected ranges (e.g., more than 2-3 standard deviations from the mean) signal potential data quality issues or genuinely unusual observations.

# Detecting outliers using standard deviation mean_sal = df['Salary'].mean() std_sal = df['Salary'].std() # Flag values beyond 2 standard deviations lower = mean_sal - 2 * std_sal upper = mean_sal + 2 * std_sal outliers = df[(df['Salary'] < lower) | (df['Salary'] > upper)] print("Outliers:\n", outliers) # Using IQR method Q1 = df['Salary'].quantile(0.25) Q3 = df['Salary'].quantile(0.75) IQR = Q3 - Q1 iqr_outliers = df[(df['Salary'] < Q1 - 1.5 * IQR) | (df['Salary'] > Q3 + 1.5 * IQR)]

3. Supporting Decision-Making

Summary statistics help stakeholders understand data at a glance without needing to examine every record. A manager can review average performance scores, salary distributions, and score variability to allocate bonuses.

Common Exam Pitfall: Remember that df.describe() only includes numeric columns by default. To include non-numeric columns, use df.describe(include='all'). It shows count, unique, top, and freq for object columns.

4.2.2 Importance of Test Datasets

In machine learning, evaluating a model on the same data it was trained on gives a misleadingly optimistic view of its performance. Test datasets are held-out portions of data that the model has never seen, used exclusively to assess generalization ability.

Why We Split Data

Concept	Purpose
Training Set	Used to train (fit) the model — the model learns patterns from this data
Test Set	Used to evaluate performance — measures how well the model generalizes to unseen data
Validation Set (optional)	Used during training to tune hyperparameters — prevents overfitting to test set

A typical split ratio is 80% training / 20% testing, though 70/30 and 75/25 are also common. The key rule is: never use test data during training.

Train/Test Split with sklearn

# Train/Test Split using scikit-learn from sklearn.model_selection import train_test_split import pandas as pd import numpy as np # Create a sample dataset np.random.seed(42) n = 200 X = pd.DataFrame({ 'feature_1': np.random.randn(n), 'feature_2': np.random.randn(n), 'feature_3': np.random.uniform(0, 10, n) }) y = (X['feature_1'] * 2 + X['feature_2'] + np.random.randn(n) * 0.5) > 0 y = y.astype(int) # Split: 80% train, 20% test X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, # 20% for testing random_state=42, # reproducibility stratify=y # maintain class proportions ) print(f"Training set size: {len(X_train)}") # 160 print(f"Test set size: {len(X_test)}") # 40 print(f"Training class distribution:\n{y_train.value_counts(normalize=True)}") print(f"Test class distribution:\n{y_test.value_counts(normalize=True)}")

Key Parameters of train_test_split():

test_size — Fraction (0.0 to 1.0) or absolute number of test samples
random_state — Seed for reproducibility; ensures the same split every time
stratify — Ensures the train and test sets have the same proportion of each class (critical for imbalanced datasets)
shuffle — Whether to shuffle data before splitting (default: True)

Data Leakage and Best Practices

Data leakage occurs when information from outside the training dataset is used to create the model. This leads to unrealistically high performance during evaluation but poor results in production.

Common Sources of Data Leakage

Target leakage: Using features that are derived from or directly correlated with the target variable (e.g., using "total_charges" to predict "is_subscribed" when charges only exist for subscribers)
Train-test contamination: Normalizing or imputing missing values on the full dataset before splitting — test statistics leak into training
Temporal leakage: Using future data to predict the past (e.g., using Friday sales data to predict Thursday demand)

Best Practices for Unbiased Evaluation

Split first, then preprocess. Fit scalers and imputers on training data only, then transform test data.
Use stratification for classification tasks to maintain class balance.
Set a random_state for reproducible results.
Never peek at test data during feature engineering or model selection.
Use cross-validation for more reliable estimates of performance.

# CORRECT: Fit scaler on training data, transform both from sklearn.preprocessing import StandardScaler scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train) # fit + transform X_test_scaled = scaler.transform(X_test) # transform only! # WRONG: This causes data leakage! # scaler.fit_transform(X) # fitting on ALL data (train + test)

Data Leakage Warning: Always apply fit_transform() on the training set and transform() on the test set. Fitting on the entire dataset leaks test set statistics into the training process.

4.2.3 Supervised Learning Algorithms and Model Accuracy

Supervised learning uses labeled data (input-output pairs) to train models that can predict outcomes for new, unseen data. There are two main categories:

Type	Target Variable	Goal	Example Algorithms
Classification	Categorical (discrete)	Predict a class label	Logistic Regression, Decision Trees, SVM, KNN, Random Forest
Regression	Continuous (numeric)	Predict a numerical value	Linear Regression, Polynomial Regression, Ridge, Lasso

Characteristics of Common Algorithms

Algorithm	Type	Key Characteristics
Linear Regression	Regression	Assumes linear relationship; simple, interpretable; prone to underfitting with complex data
Logistic Regression	Classification	Outputs probabilities via sigmoid; works for binary/multi-class; assumes linear decision boundary
Decision Trees	Both	Splits data recursively; highly interpretable; prone to overfitting without pruning
Random Forest	Both	Ensemble of decision trees; reduces overfitting; less interpretable than single tree
K-Nearest Neighbors (KNN)	Both	Instance-based; no training phase; sensitive to feature scaling and choice of k

Overfitting, Underfitting, and the Bias-Variance Tradeoff

Overfitting

Overfitting occurs when a model is too complex and learns the noise in the training data rather than the underlying pattern. It performs exceptionally well on training data but poorly on unseen test data.

High training accuracy, low test accuracy
The model has memorized the training examples
Too many parameters relative to the number of data points
Example: A high-degree polynomial that passes through every training point

Underfitting

Underfitting occurs when a model is too simple to capture the underlying patterns in the data. It performs poorly on both training and test data.

Low training accuracy, low test accuracy
The model fails to learn the relationship between features and target
Too few parameters or overly restrictive assumptions
Example: Fitting a straight line to clearly curved data

The Bias-Variance Tradeoff

Model error can be decomposed into three components: bias, variance, and irreducible error (noise). The goal is to find the sweet spot that minimizes total error.

Component	Definition	Associated With
Bias	Error from overly simplistic assumptions; the model consistently misses the true pattern	Underfitting (too simple)
Variance	Error from sensitivity to small fluctuations in training data; predictions vary widely	Overfitting (too complex)
Irreducible Error	Noise inherent in the data; cannot be reduced by any model	Data itself

How It Connects:

Linear regression tends toward high bias / low variance (underfitting) — it assumes a linear relationship regardless of true data patterns.
High-degree polynomial regression tends toward low bias / high variance (overfitting) — it can fit training data perfectly but wildly fluctuates on new data.
The optimal model complexity lies where total error (bias + variance) is minimized.

Strategies to Prevent Overfitting

Cross-validation: Evaluate model on multiple folds of data to get a robust performance estimate
Regularization: Add penalty terms (L1/Lasso, L2/Ridge) that constrain model complexity
Simplify the model: Use fewer features or lower polynomial degree
More training data: Larger datasets reduce the chance of memorizing noise
Early stopping: Stop training when validation performance starts decreasing
Pruning: Remove branches from decision trees that add little predictive power

Linear Regression: Fitting and Evaluation

# Linear Regression with scikit-learn from sklearn.linear_model import LinearRegression from sklearn.model_selection import train_test_split from sklearn.metrics import mean_squared_error, r2_score import numpy as np # Generate sample regression data np.random.seed(42) X = np.random.rand(100, 1) * 10 # Feature: values 0-10 y = 2.5 * X.flatten() + 5 + np.random.randn(100) * 2 # y = 2.5x + 5 + noise # Split data X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42 ) # Create and fit the model model = LinearRegression() model.fit(X_train, y_train) # Model parameters print(f"Coefficient (slope): {model.coef_[0]:.4f}") # ~2.5 print(f"Intercept: {model.intercept_:.4f}") # ~5.0 # Predictions y_train_pred = model.predict(X_train) y_test_pred = model.predict(X_test) # Evaluation metrics print(f"\nTraining R-squared: {r2_score(y_train, y_train_pred):.4f}") print(f"Test R-squared: {r2_score(y_test, y_test_pred):.4f}") print(f"Training MSE: {mean_squared_error(y_train, y_train_pred):.4f}") print(f"Test MSE: {mean_squared_error(y_test, y_test_pred):.4f}") # R-squared interpretation: # - 1.0 = perfect predictions # - 0.0 = model no better than predicting the mean # - Negative = model worse than predicting the mean

Logistic Regression: Classification

# Logistic Regression for binary classification from sklearn.linear_model import LogisticRegression from sklearn.metrics import (accuracy_score, confusion_matrix, classification_report) from sklearn.model_selection import train_test_split from sklearn.datasets import load_iris import numpy as np # Load the Iris dataset (simplified to binary: setosa vs not-setosa) iris = load_iris() X = iris.data y = (iris.target == 0).astype(int) # 1 = setosa, 0 = not setosa # Split X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42, stratify=y ) # Create and fit logistic regression log_model = LogisticRegression(random_state=42, max_iter=200) log_model.fit(X_train, y_train) # Predictions y_pred = log_model.predict(X_test) y_prob = log_model.predict_proba(X_test) # probability estimates # Evaluation print("Accuracy:", accuracy_score(y_test, y_pred)) print("\nConfusion Matrix:") print(confusion_matrix(y_test, y_pred)) # Predicted 0 Predicted 1 # Actual 0 [ TN FP ] # Actual 1 [ FN TP ] print("\nClassification Report:") print(classification_report(y_test, y_pred)) # precision recall f1-score support # 0 1.00 1.00 1.00 20 # 1 1.00 1.00 1.00 10 # accuracy 1.00 30

Confusion Matrix Explained:

True Positive (TP): Correctly predicted positive
True Negative (TN): Correctly predicted negative
False Positive (FP): Incorrectly predicted positive (Type I error)
False Negative (FN): Incorrectly predicted negative (Type II error)

Demonstrating Overfitting with Polynomial Regression

# Demonstrating overfitting vs underfitting with polynomial regression from sklearn.preprocessing import PolynomialFeatures from sklearn.linear_model import LinearRegression from sklearn.pipeline import make_pipeline from sklearn.metrics import mean_squared_error, r2_score import numpy as np # Generate non-linear data: y = sin(x) + noise np.random.seed(42) X = np.sort(np.random.uniform(0, 6, 30)).reshape(-1, 1) y = np.sin(X).flatten() + np.random.randn(30) * 0.2 # Split X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.3, random_state=42 ) # Compare different complexities degrees = [1, 4, 15] # underfitting, good fit, overfitting for degree in degrees: # Create polynomial regression pipeline model = make_pipeline( PolynomialFeatures(degree), LinearRegression() ) model.fit(X_train, y_train) # Evaluate on both sets train_score = r2_score(y_train, model.predict(X_train)) test_score = r2_score(y_test, model.predict(X_test)) train_mse = mean_squared_error(y_train, model.predict(X_train)) test_mse = mean_squared_error(y_test, model.predict(X_test)) print(f"\nDegree {degree}:") print(f" Train R2: {train_score:.4f} | Test R2: {test_score:.4f}") print(f" Train MSE: {train_mse:.4f} | Test MSE: {test_mse:.4f}") # Expected output pattern: # Degree 1: Low train R2, Low test R2 -> UNDERFITTING # Degree 4: Good train R2, Good test R2 -> GOOD FIT # Degree 15: ~1.0 train R2, Low test R2 -> OVERFITTING

Key Takeaway: When training R² is much higher than test R², the model is overfitting. When both are low, the model is underfitting. The best model has similar (and high) scores on both sets.

Cross-Validation

Cross-validation provides a more reliable estimate of model performance by training and evaluating the model on multiple different subsets of the data.

# K-Fold Cross-Validation from sklearn.model_selection import cross_val_score, KFold from sklearn.linear_model import LinearRegression import numpy as np # Generate sample data np.random.seed(42) X = np.random.rand(100, 3) y = 3 * X[:, 0] + 2 * X[:, 1] - X[:, 2] + np.random.randn(100) * 0.5 # Create model model = LinearRegression() # 5-Fold Cross-Validation kfold = KFold(n_splits=5, shuffle=True, random_state=42) scores = cross_val_score(model, X, y, cv=kfold, scoring='r2') print("Fold R2 scores:", scores) print(f"Mean R2: {scores.mean():.4f}") print(f"Std R2: {scores.std():.4f}") # Cross-validation for classification from sklearn.linear_model import LogisticRegression from sklearn.model_selection import cross_val_score from sklearn.datasets import load_iris iris = load_iris() log_model = LogisticRegression(max_iter=200) # 5-fold cross-validation with accuracy scoring cv_scores = cross_val_score(log_model, iris.data, iris.target, cv=5, scoring='accuracy') print(f"\nLogistic Regression CV Accuracy: {cv_scores.mean():.4f} (+/- {cv_scores.std():.4f})") # How K-Fold works: # Fold 1: [Test] [Train] [Train] [Train] [Train] # Fold 2: [Train] [Test] [Train] [Train] [Train] # Fold 3: [Train] [Train] [Test] [Train] [Train] # Fold 4: [Train] [Train] [Train] [Test] [Train] # Fold 5: [Train] [Train] [Train] [Train] [Test] # Final score = average of all 5 fold scores

Model Accuracy Metrics

Classification Metrics

Metric	Formula	Best For
Accuracy	(TP + TN) / Total	Balanced datasets
Precision	TP / (TP + FP)	When false positives are costly (spam detection)
Recall (Sensitivity)	TP / (TP + FN)	When false negatives are costly (disease screening)
F1-Score	2 * (Precision * Recall) / (Precision + Recall)	Imbalanced datasets; harmonic mean of precision and recall

Regression Metrics

Metric	What It Measures	Ideal Value
R-squared (R²)	Proportion of variance in y explained by the model	1.0 (closer to 1 is better)
MSE (Mean Squared Error)	Average of squared prediction errors; penalizes large errors	0.0 (lower is better)
RMSE	Square root of MSE; in same units as target variable	0.0 (lower is better)
MAE (Mean Absolute Error)	Average absolute difference between predicted and actual	0.0 (lower is better)

# Computing all metrics in Python from sklearn.metrics import (accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, mean_squared_error, r2_score, mean_absolute_error) import numpy as np # --- Classification Metrics --- y_true_cls = [1, 0, 1, 1, 0, 1, 0, 0, 1, 0] y_pred_cls = [1, 0, 1, 0, 0, 1, 1, 0, 1, 0] print("Accuracy: ", accuracy_score(y_true_cls, y_pred_cls)) # 0.80 print("Precision:", precision_score(y_true_cls, y_pred_cls)) # 0.80 print("Recall: ", recall_score(y_true_cls, y_pred_cls)) # 0.80 print("F1-Score: ", f1_score(y_true_cls, y_pred_cls)) # 0.80 cm = confusion_matrix(y_true_cls, y_pred_cls) print("Confusion Matrix:\n", cm) # [[TN, FP], # [FN, TP]] # --- Regression Metrics --- y_true_reg = np.array([3.0, 5.0, 2.5, 7.0, 4.5]) y_pred_reg = np.array([2.8, 5.2, 2.1, 6.8, 4.9]) print("\nR-squared:", r2_score(y_true_reg, y_pred_reg)) print("MSE: ", mean_squared_error(y_true_reg, y_pred_reg)) print("RMSE: ", np.sqrt(mean_squared_error(y_true_reg, y_pred_reg))) print("MAE: ", mean_absolute_error(y_true_reg, y_pred_reg))

Exam Tip: Choosing the Right Metric

Use accuracy when classes are balanced.
Use precision when the cost of false positives is high (e.g., spam filter marking important emails as spam).
Use recall when the cost of false negatives is high (e.g., failing to detect a disease).
Use F1-score when you need a balance between precision and recall, especially with imbalanced data.
Use R-squared and MSE for regression problems.

Practice Quiz: Statistical Methods and Machine Learning

Test your understanding with these 10 questions. Click on an option to check your answer.

Q1: What does the df.describe() method return by default?

A) Summary statistics for all columns including object types

B) Summary statistics (count, mean, std, min, 25%, 50%, 75%, max) for numeric columns only

C) Only the mean and standard deviation of all columns

D) A correlation matrix of all numeric columns

Correct: B. By default, df.describe() computes count, mean, std, min, 25th percentile, 50th percentile (median), 75th percentile, and max for numeric columns only. Use df.describe(include='all') to include non-numeric columns.

Q2: By default, np.std() computes the population standard deviation (ddof=0), while df.std() in Pandas computes the sample standard deviation (ddof=1). What is the practical difference?

A) There is no practical difference; both produce the same value

B) The sample standard deviation (ddof=1) divides by (n-1) and is slightly larger, providing an unbiased estimate for sample data

C) The population standard deviation (ddof=0) is always larger because it divides by n

D) Pandas always gives incorrect results for standard deviation

Correct: B. Dividing by (n-1) instead of n corrects for the bias in estimating the population variance from a sample. This is called Bessel's correction. The sample std (ddof=1) is slightly larger than the population std (ddof=0). Both NumPy and Pandas can use either, but their defaults differ.

Q3: Why is it critical to split data into training and testing sets before model evaluation?

A) To reduce the size of the dataset and speed up training

B) To ensure the model trains on only easy examples

C) To evaluate how well the model generalizes to unseen data, preventing overly optimistic performance estimates

D) Because scikit-learn requires the data to be split before training

Correct: C. Evaluating on the same data used for training gives misleadingly high performance metrics because the model has already memorized the training examples. A separate test set provides an honest estimate of how the model performs on new, unseen data.

Q4: What is data leakage?

A) When the dataset is too small to train a model

B) When information from outside the training set is used during model training, leading to unrealistically high performance

C) When a model underfits the data due to insufficient features

D) When training data is stored in an insecure database

Correct: B. Data leakage occurs when test set information bleeds into the training process. A common example is fitting a scaler on the entire dataset (train + test) before splitting, which allows the model to learn from test data statistics. Always split first, then preprocess.

Q5: A model achieves 99% accuracy on the training set but only 55% on the test set. This is most likely a case of:

A) Underfitting

B) Overfitting

C) A good generalization

D) Data leakage

Correct: B. A large gap between training accuracy (99%) and test accuracy (55%) is the hallmark of overfitting. The model has memorized the training data (including its noise) but cannot generalize to new data. The model is too complex for the amount of training data.

Q6: In the bias-variance tradeoff, a simple linear regression model applied to highly non-linear data tends to have:

A) High bias and low variance

B) Low bias and high variance

C) Low bias and low variance

D) High bias and high variance

Correct: A. Linear regression makes strong assumptions about linearity. When applied to non-linear data, it consistently underfits (high bias) because it cannot capture the true pattern. However, its predictions are stable across different training sets (low variance).

Q7: What is the purpose of the stratify parameter in train_test_split()?

A) To sort the data before splitting

B) To increase the size of the test set

C) To ensure both training and test sets have the same proportion of each class

D) To apply feature scaling automatically

Correct: C. The stratify parameter ensures that the class distribution in the original dataset is preserved in both the training and test splits. This is especially important for imbalanced datasets where one class is underrepresented.

Q8: Which metric is most appropriate when the cost of false negatives is very high (e.g., failing to diagnose a serious disease)?

A) Accuracy

B) Precision

C) Recall

D) R-squared

Correct: C. Recall = TP / (TP + FN) measures the proportion of actual positives that are correctly identified. When false negatives are costly (missing a disease diagnosis), maximizing recall ensures that as many true positive cases as possible are detected, even at the expense of more false positives.

Q9: In 5-fold cross-validation, how many times is the model trained and evaluated?

A) 1 time

B) 5 times, each using a different 20% of the data as the test set

C) 5 times, each using 50% of the data for training

D) 25 times, testing every combination of folds

Correct: B. In 5-fold cross-validation, the data is divided into 5 equal parts. The model is trained 5 times, each time using 4 folds (80%) for training and 1 fold (20%) for testing. Each fold serves as the test set exactly once. The final performance metric is the average across all 5 folds.

Q10: Which of the following strategies helps prevent overfitting?

A) Increasing model complexity by adding more polynomial features

B) Training the model for more iterations without a stopping criterion

C) Applying regularization (e.g., L1 or L2 penalty) and using cross-validation

D) Removing the test set and training on all available data

Correct: C. Regularization (L1/Lasso or L2/Ridge) adds a penalty term that discourages overly complex models by constraining large coefficients. Combined with cross-validation for robust performance estimation, these are the primary strategies to combat overfitting. Options A, B, and D would all increase or mask overfitting.

Previous: 4.1 Pandas and NumPy Next: 5.1 Data Visualization