Topic 3.1: Descriptive Statistics

3.1.1 Measures of Central Tendency

Measures of central tendency describe the center or typical value of a dataset. The three primary measures are mean, median, and mode. Choosing the right measure depends on the data's distribution and the presence of outliers.

Mean (Arithmetic Average)

The mean is the sum of all values divided by the number of values. It is the most commonly used measure of central tendency.

# Formula: Mean = (sum of all values) / (number of values) # x̄ = (x₁ + x₂ + ... + xₙ) / n = Σxᵢ / n

When to use: Best for symmetric distributions without extreme outliers.
Sensitivity: Highly sensitive to outliers — a single extreme value can pull the mean significantly.
Example: The mean of [10, 20, 30, 40, 100] is 200/5 = 40. The value 100 pulls the mean above most values.

Median

The median is the middle value when data is sorted in ascending order. If the dataset has an even number of values, the median is the average of the two middle values.

# For odd n: Median = value at position (n+1)/2 # For even n: Median = average of values at positions n/2 and n/2 + 1

When to use: Preferred for skewed distributions or data with outliers.
Sensitivity: Robust to outliers — not affected by extreme values.
Example: The median of [10, 20, 30, 40, 100] is 30. The outlier 100 does not influence it.

Mode

The mode is the value that appears most frequently in a dataset. A dataset can be unimodal (one mode), bimodal (two modes), or multimodal (more than two modes).

When to use: Best for categorical data or identifying the most common value.
Sensitivity: Not affected by outliers. Can be used with non-numeric data.
Example: The mode of [1, 2, 2, 3, 4, 4, 4, 5] is 4 (appears 3 times).

Exam Tip

For skewed data, the median is generally preferred over the mean. For categorical data, the mode is the only applicable measure. The mean is best when data is roughly symmetric.

Measure	Best For	Sensitive to Outliers?	Data Types
Mean	Symmetric data	Yes	Numeric only
Median	Skewed data	No	Numeric / Ordinal
Mode	Categorical data	No	Any type

3.1.2 Measures of Spread (Dispersion)

Spread measures describe how much variability exists in a dataset. They tell you how far data values are from the center. Key measures include range, variance, standard deviation, and interquartile range (IQR).

Range

The simplest measure of spread. It is the difference between the maximum and minimum values.

# Range = Max - Min # Example: Range of [3, 7, 8, 15, 22] = 22 - 3 = 19

Easy to calculate but extremely sensitive to outliers.
Uses only two data points, so it ignores the rest of the distribution.

Variance

Variance measures the average squared deviation of each value from the mean. There are two forms:

# Population Variance: σ² = Σ(xᵢ - μ)² / N # Sample Variance: s² = Σ(xᵢ - x̄)² / (n - 1) (Bessel's correction)

Sample variance uses n - 1 (Bessel's correction) to produce an unbiased estimate of the population variance.
Units are squared (e.g., if data is in meters, variance is in meters²), which can be hard to interpret directly.

Standard Deviation

The standard deviation is the square root of the variance. It is expressed in the same units as the data, making it easier to interpret.

# Population Std Dev: σ = √(σ²) # Sample Std Dev: s = √(s²)

A low standard deviation indicates data points are close to the mean.
A high standard deviation indicates data points are spread out.
In a normal distribution, approximately 68% of values fall within 1 std dev of the mean, 95% within 2, and 99.7% within 3 (the 68-95-99.7 rule).

Interquartile Range (IQR)

The IQR is the range of the middle 50% of the data. It is the difference between the 75th percentile (Q3) and the 25th percentile (Q1).

# IQR = Q3 - Q1
# Q1 = 25th percentile (lower quartile)
# Q3 = 75th percentile (upper quartile)

Robust to outliers — only considers the middle 50% of data.
Used in boxplots and for outlier detection (the 1.5*IQR rule).

Key Concept: Bessel's Correction

When computing sample variance, we divide by n - 1 instead of n. This corrects the bias that arises because we are using the sample mean (an estimate) rather than the true population mean. NumPy's np.var() defaults to population variance (ddof=0), while np.std() also defaults to ddof=0. Use ddof=1 for sample statistics.

Measure	Formula	Robust to Outliers?	Units
Range	Max - Min	No	Same as data
Variance	Σ(xᵢ - x̄)² / (n-1)	No	Squared
Std Deviation	√(Variance)	No	Same as data
IQR	Q3 - Q1	Yes	Same as data

3.1.3 Statistical Distributions

A statistical distribution describes how data values are spread across the possible range. Understanding distributions is fundamental to choosing the right statistical methods.

Gaussian (Normal) Distribution

The most important distribution in statistics. It is symmetric, bell-shaped, and characterized by its mean (μ) and standard deviation (σ).

Shape: Symmetric, bell-shaped curve centered on the mean.
Properties: Mean = Median = Mode. Tails extend infinitely but approach zero.
68-95-99.7 Rule: 68% of data falls within ±1σ, 95% within ±2σ, 99.7% within ±3σ.
Examples: Heights, weights, test scores, measurement errors often approximate a normal distribution.
Standard Normal: A special case where μ = 0 and σ = 1 (Z-distribution).

Uniform Distribution

All values in a given range have equal probability. The distribution is flat (rectangular).

Shape: Flat, rectangular — all outcomes are equally likely.
Properties: Mean = (a + b) / 2, where a and b are the bounds. Variance = (b - a)² / 12.
Examples: Rolling a fair die, random number generators, selecting a random time within an interval.

Exam Tip

The normal distribution is the foundation for many statistical tests and confidence intervals. If the exam asks about "bell-shaped" or "Gaussian" data, it is referring to the normal distribution. Many real-world datasets approximate it due to the Central Limit Theorem.

3.1.4 Interpreting Distributions in Context

Distributions can be analyzed from different perspectives depending on the number of variables and the nature of the analysis.

Univariate Analysis

Examines a single variable at a time. Focuses on describing the distribution of one variable using histograms, box plots, and summary statistics (mean, median, standard deviation).

Ask: What is the shape? Is it symmetric or skewed? Are there outliers?
Tools: Histograms, density plots, box plots.

Bivariate Analysis

Examines the relationship between two variables. Helps identify associations, correlations, or dependencies.

Ask: Is there a linear relationship? How strong is the correlation?
Tools: Scatterplots, correlation coefficients, cross-tabulations.

Multivariate Analysis

Examines three or more variables simultaneously to understand complex relationships and interactions.

Ask: How do multiple variables interact? Which variables predict an outcome?
Tools: Correlation heatmaps, pair plots, regression models.

Distributions Over Time

When data is collected over time (time series), the distribution may shift, trend, or exhibit seasonality. Line plots and rolling statistics help track how central tendency and spread change.

Trends: Upward or downward shifts in the distribution center over time.
Seasonality: Repeating patterns at regular intervals.
Tools: Line plots, rolling mean/standard deviation.

3.1.5 Confidence Measures and Data Reliability

Confidence measures quantify how certain we can be about our statistical estimates. They provide a range within which the true population parameter is likely to fall.

Standard Error of the Mean (SEM)

The standard error measures how much the sample mean is expected to vary from the true population mean.

# SEM = s / √n # where s = sample standard deviation, n = sample size

As sample size increases, SEM decreases — larger samples produce more reliable estimates.
SEM is always smaller than the standard deviation (it divides by √n).

Confidence Intervals

A confidence interval provides a range of values that is likely to contain the population parameter with a specified level of confidence (commonly 95%).

# 95% Confidence Interval = x̄ ± 1.96 * SEM # (For large samples using the Z-distribution)

A 95% confidence interval means: if we repeated the sampling process many times, about 95% of the intervals would contain the true population parameter.
Wider intervals indicate more uncertainty; narrower intervals indicate more precision.
Increasing sample size narrows the confidence interval.

Common Misconception

A 95% confidence interval does NOT mean there is a 95% probability that the true parameter lies within this specific interval. It means the procedure used to build the interval captures the true parameter 95% of the time over many samples.

3.1.6 Skewness and Kurtosis

Skewness and kurtosis describe the shape of a distribution beyond just center and spread.

Skewness

Skewness measures the asymmetry of a distribution.

Positive skew (right-skewed): The right tail is longer. Mean > Median > Mode. Example: income distribution — most people earn moderate amounts, but a few earn very high amounts.
Negative skew (left-skewed): The left tail is longer. Mean < Median < Mode. Example: exam scores on an easy test — most students score high, but a few score very low.
Zero skew: The distribution is symmetric. Mean = Median = Mode.

Kurtosis

Kurtosis measures the "tailedness" of a distribution — how heavy or light the tails are relative to a normal distribution.

Mesokurtic (kurtosis ≈ 3): Similar tails to a normal distribution. Excess kurtosis ≈ 0.
Leptokurtic (kurtosis > 3): Heavier tails and sharper peak. More extreme outliers. Excess kurtosis > 0.
Platykurtic (kurtosis < 3): Lighter tails and flatter peak. Fewer extreme outliers. Excess kurtosis < 0.

Excess Kurtosis

Many software packages (including scipy.stats.kurtosis()) report excess kurtosis, which subtracts 3 from the raw kurtosis so that a normal distribution has excess kurtosis of 0.

Measure	What It Describes	Normal Value	Python Function
Skewness	Asymmetry (left/right tail)	0 (symmetric)	`scipy.stats.skew()`
Kurtosis	Tail heaviness	0 (excess kurtosis)	`scipy.stats.kurtosis()`

3.1.7 Identifying Outliers

Outliers are data points that differ significantly from other observations. They can result from measurement error, data entry mistakes, or genuine extreme values. Identifying and understanding outliers is critical for reliable analysis.

Visual Methods

Boxplots: Points beyond the whiskers (1.5 * IQR from Q1 or Q3) are flagged as outliers.
Histograms: Isolated bars far from the main body of data suggest outliers.
Scatterplots: Points far from the general cluster or regression line may be outliers.

Statistical Methods

IQR Method (Tukey's Fences)

A value is considered an outlier if it falls below Q1 - 1.5*IQR or above Q3 + 1.5*IQR.

# IQR Method for Outlier Detection
Q1 = np.percentile(data, 25)
Q3 = np.percentile(data, 75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers = [x for x in data if x < lower_bound or x > upper_bound]

Z-Score Method

The Z-score measures how many standard deviations a value is from the mean. Values with |Z| > 2 or |Z| > 3 are commonly considered outliers.

# Z-Score = (x - mean) / standard_deviation # Typically, |Z| > 3 flags a value as an outlier

Important

Before removing outliers, always investigate their cause. Outliers due to data entry errors should be corrected or removed. Genuine extreme values may carry important information and should be handled carefully (e.g., using robust statistics or transformations).

3.1.8 Correlation and Pearson's R

Correlation measures the strength and direction of a linear relationship between two variables. It is one of the most fundamental concepts in data analysis.

Types of Correlation

Positive correlation: As one variable increases, the other tends to increase. Example: height and weight.
Negative correlation: As one variable increases, the other tends to decrease. Example: hours of TV watched and exam scores.
No correlation (zero): No linear relationship between the variables. Example: shoe size and IQ.

Pearson's Correlation Coefficient (r)

Pearson's r quantifies the linear relationship between two continuous variables. Its value ranges from -1 to +1.

# Pearson's r formula:
# r = Σ[(xᵢ - x̄)(yᵢ - ȳ)] / √[Σ(xᵢ - x̄)² * Σ(yᵢ - ȳ)²]
#
# Equivalently:  r = Cov(X, Y) / (s_x * s_y)

Interpreting Pearson's r

r Value	Strength	Direction
+0.7 to +1.0	Strong	Positive
+0.4 to +0.7	Moderate	Positive
+0.1 to +0.4	Weak	Positive
-0.1 to +0.1	Very weak / None	None
-0.4 to -0.1	Weak	Negative
-0.7 to -0.4	Moderate	Negative
-1.0 to -0.7	Strong	Negative

Correlation Does Not Imply Causation

A strong correlation between two variables does not mean one causes the other. There may be a confounding variable, reverse causation, or pure coincidence. Always consider the context and apply domain knowledge.

Key Properties of Pearson's r

Range: Always between -1 and +1.
r = +1: Perfect positive linear relationship.
r = -1: Perfect negative linear relationship.
r = 0: No linear relationship (but nonlinear relationships may still exist).
Unitless: Not affected by changes in units or scale of the variables.
Symmetric: r(X, Y) = r(Y, X).
Limitation: Only measures linear relationships. A perfect curve would yield r close to 0.

3.1.9 Interpreting Common Plot Types

Understanding what each type of plot reveals about data is essential for data analysis. Here is a guide to the most common plot types used in descriptive statistics.

Boxplot (Box-and-Whisker Plot)

Shows: Median, Q1, Q3, IQR, whiskers, and outliers.
Reveals: Data spread, skewness (asymmetric box), presence of outliers (dots beyond whiskers).
Use case: Comparing distributions across categories; identifying outliers.
Reading: The box spans Q1 to Q3. The line inside is the median. Whiskers extend to 1.5*IQR. Points beyond are outliers.

Histogram

Shows: Frequency distribution of a single continuous variable.
Reveals: Shape (normal, skewed, bimodal), center, spread, and gaps in data.
Use case: Understanding the overall distribution of a single variable.
Reading: Each bar represents a range (bin). The height shows how many values fall in that range.

Scatterplot

Shows: The relationship between two continuous variables.
Reveals: Direction (positive/negative), strength, and form (linear/nonlinear) of relationships; outliers.
Use case: Bivariate analysis; checking for correlation before computing Pearson's r.
Reading: Each point is one observation. Patterns (upward, downward, cluster, fan shape) indicate relationships.

Line Plot

Shows: Trends over time or a continuous ordered variable.
Reveals: Trends, seasonality, cycles, and sudden changes in data over time.
Use case: Time series analysis; tracking metrics over time.
Reading: X-axis is time/order, Y-axis is the measured value. Look for upward/downward trends and repeating patterns.

Correlation Heatmap

Shows: Correlation coefficients between all pairs of variables in a matrix.
Reveals: Which pairs of variables are strongly correlated (positive or negative); multicollinearity.
Use case: Multivariate analysis; feature selection; identifying redundant variables.
Reading: Colors represent correlation strength. Dark red/blue = strong correlation. White/light = weak correlation. The diagonal is always 1 (a variable with itself).

Plot Selection Guide

Use histograms for distribution shape, boxplots for comparing groups and finding outliers, scatterplots for two-variable relationships, line plots for time trends, and heatmaps for an overview of all pairwise correlations.

Plot Type	Variables	Primary Purpose
Histogram	1 continuous	Distribution shape
Boxplot	1 continuous (optionally grouped by 1 categorical)	Spread, outliers, comparison
Scatterplot	2 continuous	Relationship between variables
Line Plot	1 continuous over time	Trends over time
Heatmap	Many continuous (pairwise)	Correlation overview

3.1.10 Python Code Examples

Calculating Mean, Median, Mode

import statistics

data = [12, 15, 15, 18, 20, 22, 25, 28, 30, 35]

# Mean
mean_val = statistics.mean(data)
print(f"Mean: {mean_val}")          # Mean: 22.0

# Median
median_val = statistics.median(data)
print(f"Median: {median_val}")      # Median: 21.0

# Mode
mode_val = statistics.mode(data)
print(f"Mode: {mode_val}")          # Mode: 15

# For multiple modes (Python 3.8+)
multimode_val = statistics.multimode(data)
print(f"Multimode: {multimode_val}") # Multimode: [15]

Calculating Variance and Standard Deviation with NumPy

import numpy as np

data = np.array([12, 15, 15, 18, 20, 22, 25, 28, 30, 35])

# Population variance and std (ddof=0, the default)
pop_var = np.var(data)
pop_std = np.std(data)
print(f"Population Variance: {pop_var:.2f}")   # 46.40
print(f"Population Std Dev: {pop_std:.2f}")    # 6.81

# Sample variance and std (ddof=1 for Bessel's correction)
sample_var = np.var(data, ddof=1)
sample_std = np.std(data, ddof=1)
print(f"Sample Variance: {sample_var:.2f}")     # 51.56
print(f"Sample Std Dev: {sample_std:.2f}")      # 7.18

# Range
data_range = np.ptp(data)  # ptp = peak to peak
print(f"Range: {data_range}")                  # 23

# IQR
Q1 = np.percentile(data, 25)
Q3 = np.percentile(data, 75)
IQR = Q3 - Q1
print(f"IQR: {IQR}")                          # 12.25

Creating and Interpreting a Normal Distribution

import numpy as np
from scipy import stats

# Generate 10,000 samples from a normal distribution
mu, sigma = 100, 15
samples = np.random.normal(mu, sigma, 10000)

# Verify the 68-95-99.7 rule
within_1sd = np.mean((samples >= mu - sigma) & (samples <= mu + sigma))
within_2sd = np.mean((samples >= mu - 2*sigma) & (samples <= mu + 2*sigma))
within_3sd = np.mean((samples >= mu - 3*sigma) & (samples <= mu + 3*sigma))

print(f"Within 1 SD: {within_1sd:.1%}")   # ~68%
print(f"Within 2 SD: {within_2sd:.1%}")   # ~95%
print(f"Within 3 SD: {within_3sd:.1%}")   # ~99.7%

# Skewness and kurtosis
print(f"Skewness: {stats.skew(samples):.4f}")       # ~0
print(f"Kurtosis: {stats.kurtosis(samples):.4f}")   # ~0 (excess)

Calculating Pearson's R

import numpy as np
from scipy import stats

# Sample data: study hours vs exam score
hours = np.array([1, 2, 3, 4, 5, 6, 7, 8])
scores = np.array([45, 55, 50, 65, 70, 72, 80, 90])

# Method 1: scipy.stats.pearsonr (returns r and p-value)
r, p_value = stats.pearsonr(hours, scores)
print(f"Pearson r: {r:.4f}")      # ~0.9706
print(f"P-value: {p_value:.4f}")    # ~0.0001 (very significant)

# Method 2: numpy.corrcoef (returns correlation matrix)
corr_matrix = np.corrcoef(hours, scores)
print(f"Correlation matrix:\n{corr_matrix}")
# [[1.0,   ~0.97]
#  [~0.97,  1.0 ]]
r_numpy = corr_matrix[0, 1]
print(f"Pearson r (numpy): {r_numpy:.4f}")

Creating Boxplots, Histograms, and Scatterplots

import matplotlib.pyplot as plt
import numpy as np

np.random.seed(42)
data = np.random.normal(100, 15, 200)

# --- Histogram ---
plt.figure(figsize=(8, 5))
plt.hist(data, bins=20, color='#2563eb', edgecolor='white', alpha=0.8)
plt.axvline(np.mean(data), color='red', linestyle='--', label=f'Mean: {np.mean(data):.1f}')
plt.axvline(np.median(data), color='green', linestyle='--', label=f'Median: {np.median(data):.1f}')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.title('Histogram with Mean and Median')
plt.legend()
plt.show()

# --- Boxplot ---
plt.figure(figsize=(8, 5))
plt.boxplot(data, vert=True, patch_artist=True,
          boxprops=dict(facecolor='#dbeafe', color='#2563eb'),
          medianprops=dict(color='red'))
plt.title('Boxplot')
plt.ylabel('Value')
plt.show()

# --- Scatterplot ---
x = np.random.normal(50, 10, 100)
y = 2 * x + np.random.normal(0, 10, 100)

plt.figure(figsize=(8, 5))
plt.scatter(x, y, alpha=0.6, color='#2563eb', edgecolors='white')
plt.xlabel('X variable')
plt.ylabel('Y variable')
plt.title('Scatterplot: Positive Correlation')
plt.show()

Interpreting a Correlation Heatmap

import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import pandas as pd

# Create sample DataFrame
np.random.seed(42)
df = pd.DataFrame({
    'Height': np.random.normal(170, 10, 100),
    'Weight': np.random.normal(70, 12, 100),
    'Age':    np.random.randint(20, 60, 100),
    'Score':  np.random.normal(75, 8, 100)
})
# Add a correlated column
df['Weight'] = df['Height'] * 0.6 + np.random.normal(0, 5, 100)

# Compute correlation matrix
corr = df.corr()

# Plot heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(corr, annot=True, cmap='coolwarm', center=0,
            fmt='.2f', square=True, linewidths=0.5)
plt.title('Correlation Heatmap')
plt.tight_layout()
plt.show()

# Reading the heatmap:
# - Diagonal is always 1.0 (variable correlated with itself)
# - Dark red = strong positive correlation
# - Dark blue = strong negative correlation
# - Light/white = weak or no correlation

Identifying Outliers with IQR and Z-Score

import numpy as np
from scipy import stats

data = np.array([12, 15, 14, 10, 13, 15, 14, 11, 100, 13, 12, 14])

# === IQR Method ===
Q1 = np.percentile(data, 25)
Q3 = np.percentile(data, 75)
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR

iqr_outliers = data[(data < lower) | (data > upper)]
print(f"IQR bounds: [{lower:.1f}, {upper:.1f}]")
print(f"IQR outliers: {iqr_outliers}")   # [100]

# === Z-Score Method ===
z_scores = stats.zscore(data)
z_outliers = data[np.abs(z_scores) > 2]
print(f"Z-scores: {np.round(z_scores, 2)}")
print(f"Z-score outliers (|z|>2): {z_outliers}")  # [100]

Practice Quiz: Descriptive Statistics

Q1. Which measure of central tendency is most appropriate for a dataset with significant outliers?

A) Mean

B) Median

C) Mode

D) Standard deviation

Correct: B) Median. The median is robust to outliers because it depends only on the middle value(s) in the sorted data. The mean is pulled toward extreme values, and mode measures frequency rather than central position. Standard deviation is a measure of spread, not central tendency.

Q2. In NumPy, what does np.var(data) compute by default?

A) Population variance (ddof=0)

B) Sample variance (ddof=1)

C) Standard deviation

D) Interquartile range

Correct: A) Population variance (ddof=0). By default, NumPy's np.var() uses ddof=0, which computes the population variance. To compute the sample variance with Bessel's correction, you must pass ddof=1.

Q3. What does a Pearson's r value of -0.85 indicate?

A) A weak positive linear relationship

B) A strong negative linear relationship

C) No linear relationship

D) A strong positive linear relationship

Correct: B) A strong negative linear relationship. Pearson's r ranges from -1 to +1. A value of -0.85 is close to -1, indicating a strong inverse relationship: as one variable increases, the other tends to decrease.

Q4. According to the 68-95-99.7 rule, approximately what percentage of data falls within 2 standard deviations of the mean in a normal distribution?

A) 68%

B) 95%

C) 99.7%

D) 50%

Correct: B) 95%. The 68-95-99.7 rule states that approximately 68% of data falls within 1 SD, 95% within 2 SDs, and 99.7% within 3 SDs of the mean in a normal distribution.

Q5. Which plot type is best suited for identifying outliers in a dataset?

A) Line plot

B) Boxplot

C) Correlation heatmap

D) Pie chart

Correct: B) Boxplot. Boxplots explicitly display outliers as individual points beyond the whiskers (1.5 * IQR from Q1 and Q3). They provide a clear visual summary of the data spread and make outliers immediately visible.

Q6. A dataset has a positive skew. Which statement is true about the relationship between mean and median?

A) Mean = Median

B) Mean > Median

C) Mean < Median

D) Mean and Median are unrelated to skewness

Correct: B) Mean > Median. In a positively (right) skewed distribution, the right tail is longer. The mean is pulled toward the tail, making it greater than the median. The general relationship is: Mean > Median > Mode for positive skew.

Q7. Using the IQR method, a value is considered an outlier if it falls:

A) More than 1 standard deviation from the mean

B) Beyond the range (min to max)

C) Below Q1 - 1.5*IQR or above Q3 + 1.5*IQR

D) Below Q1 - 3*IQR or above Q3 + 3*IQR

Correct: C) Below Q1 - 1.5*IQR or above Q3 + 1.5*IQR. The standard IQR method (Tukey's fences) defines outliers as values falling below Q1 - 1.5*IQR or above Q3 + 1.5*IQR. Values beyond 3*IQR are sometimes called "extreme outliers."

Q8. What does the scipy.stats.pearsonr() function return?

A) Only the correlation coefficient

B) The correlation coefficient and the p-value

C) The correlation matrix

D) The covariance and standard deviations

Correct: B) The correlation coefficient and the p-value. scipy.stats.pearsonr(x, y) returns a tuple of (r, p-value), where r is Pearson's correlation coefficient and the p-value indicates the statistical significance of the correlation. Use np.corrcoef() for the full correlation matrix.

Q9. In a correlation heatmap, a cell colored dark blue between variables A and B indicates:

A) A strong positive correlation between A and B

B) A strong negative correlation between A and B

C) No correlation between A and B

D) A and B are identical variables

Correct: B) A strong negative correlation between A and B. In the commonly used "coolwarm" colormap, dark blue represents values close to -1 (strong negative correlation), dark red represents values close to +1 (strong positive correlation), and white/light colors represent values near 0 (weak or no correlation).

Q10. Which statement about standard deviation and variance is FALSE?

A) Standard deviation is the square root of variance

B) Variance is expressed in the same units as the original data

C) A larger standard deviation indicates more spread in the data

D) Sample variance uses n-1 in the denominator (Bessel's correction)

Correct: B) Variance is expressed in the same units as the original data. This statement is FALSE. Variance is expressed in squared units (e.g., meters squared). It is the standard deviation that is expressed in the same units as the original data, making it easier to interpret.

Topic 3.1: Descriptive Statistics

3.1.1 Measures of Central Tendency

Mean (Arithmetic Average)

Median

Mode

3.1.2 Measures of Spread (Dispersion)

Range

Variance

Standard Deviation

Interquartile Range (IQR)

3.1.3 Statistical Distributions

Gaussian (Normal) Distribution

Uniform Distribution

3.1.4 Interpreting Distributions in Context

Univariate Analysis

Bivariate Analysis

Multivariate Analysis

Distributions Over Time

3.1.5 Confidence Measures and Data Reliability

Standard Error of the Mean (SEM)

Confidence Intervals

3.1.6 Skewness and Kurtosis

Skewness

Kurtosis

3.1.7 Identifying Outliers

Visual Methods

Statistical Methods

IQR Method (Tukey's Fences)

Z-Score Method

3.1.8 Correlation and Pearson's R

Types of Correlation

Pearson's Correlation Coefficient (r)

Interpreting Pearson's r

Key Properties of Pearson's r

3.1.9 Interpreting Common Plot Types

Boxplot (Box-and-Whisker Plot)

Histogram

Scatterplot

Line Plot

Correlation Heatmap

3.1.10 Python Code Examples

Calculating Mean, Median, Mode

Calculating Variance and Standard Deviation with NumPy

Creating and Interpreting a Normal Distribution

Calculating Pearson's R

Creating Boxplots, Histograms, and Scatterplots

Interpreting a Correlation Heatmap

Identifying Outliers with IQR and Z-Score

Practice Quiz: Descriptive Statistics

Navigation