Measures of central tendency describe the center or typical value of a dataset. The three primary measures are mean, median, and mode. Choosing the right measure depends on the data's distribution and the presence of outliers.
Mean (Arithmetic Average)
The mean is the sum of all values divided by the number of values. It is the most commonly used measure of central tendency.
# Formula: Mean = (sum of all values) / (number of values)# x̄ = (x₁ + x₂ + ... + xₙ) / n = Σxᵢ / n
When to use: Best for symmetric distributions without extreme outliers.
Sensitivity: Highly sensitive to outliers — a single extreme value can pull the mean significantly.
Example: The mean of [10, 20, 30, 40, 100] is 200/5 = 40. The value 100 pulls the mean above most values.
Median
The median is the middle value when data is sorted in ascending order. If the dataset has an even number of values, the median is the average of the two middle values.
# For odd n: Median = value at position (n+1)/2# For even n: Median = average of values at positions n/2 and n/2 + 1
When to use: Preferred for skewed distributions or data with outliers.
Sensitivity: Robust to outliers — not affected by extreme values.
Example: The median of [10, 20, 30, 40, 100] is 30. The outlier 100 does not influence it.
Mode
The mode is the value that appears most frequently in a dataset. A dataset can be unimodal (one mode), bimodal (two modes), or multimodal (more than two modes).
When to use: Best for categorical data or identifying the most common value.
Sensitivity: Not affected by outliers. Can be used with non-numeric data.
Example: The mode of [1, 2, 2, 3, 4, 4, 4, 5] is 4 (appears 3 times).
Exam Tip
For skewed data, the median is generally preferred over the mean. For categorical data, the mode is the only applicable measure. The mean is best when data is roughly symmetric.
Measure
Best For
Sensitive to Outliers?
Data Types
Mean
Symmetric data
Yes
Numeric only
Median
Skewed data
No
Numeric / Ordinal
Mode
Categorical data
No
Any type
3.1.2 Measures of Spread (Dispersion)
Spread measures describe how much variability exists in a dataset. They tell you how far data values are from the center. Key measures include range, variance, standard deviation, and interquartile range (IQR).
Range
The simplest measure of spread. It is the difference between the maximum and minimum values.
# Range = Max - Min# Example: Range of [3, 7, 8, 15, 22] = 22 - 3 = 19
Easy to calculate but extremely sensitive to outliers.
Uses only two data points, so it ignores the rest of the distribution.
Variance
Variance measures the average squared deviation of each value from the mean. There are two forms:
Robust to outliers — only considers the middle 50% of data.
Used in boxplots and for outlier detection (the 1.5*IQR rule).
Key Concept: Bessel's Correction
When computing sample variance, we divide by n - 1 instead of n. This corrects the bias that arises because we are using the sample mean (an estimate) rather than the true population mean. NumPy's np.var() defaults to population variance (ddof=0), while np.std() also defaults to ddof=0. Use ddof=1 for sample statistics.
Measure
Formula
Robust to Outliers?
Units
Range
Max - Min
No
Same as data
Variance
Σ(xᵢ - x̄)² / (n-1)
No
Squared
Std Deviation
√(Variance)
No
Same as data
IQR
Q3 - Q1
Yes
Same as data
3.1.3 Statistical Distributions
A statistical distribution describes how data values are spread across the possible range. Understanding distributions is fundamental to choosing the right statistical methods.
Gaussian (Normal) Distribution
The most important distribution in statistics. It is symmetric, bell-shaped, and characterized by its mean (μ) and standard deviation (σ).
Shape: Symmetric, bell-shaped curve centered on the mean.
Properties: Mean = Median = Mode. Tails extend infinitely but approach zero.
68-95-99.7 Rule: 68% of data falls within ±1σ, 95% within ±2σ, 99.7% within ±3σ.
Examples: Heights, weights, test scores, measurement errors often approximate a normal distribution.
Standard Normal: A special case where μ = 0 and σ = 1 (Z-distribution).
Uniform Distribution
All values in a given range have equal probability. The distribution is flat (rectangular).
Shape: Flat, rectangular — all outcomes are equally likely.
Properties: Mean = (a + b) / 2, where a and b are the bounds. Variance = (b - a)² / 12.
Examples: Rolling a fair die, random number generators, selecting a random time within an interval.
Exam Tip
The normal distribution is the foundation for many statistical tests and confidence intervals. If the exam asks about "bell-shaped" or "Gaussian" data, it is referring to the normal distribution. Many real-world datasets approximate it due to the Central Limit Theorem.
3.1.4 Interpreting Distributions in Context
Distributions can be analyzed from different perspectives depending on the number of variables and the nature of the analysis.
Univariate Analysis
Examines a single variable at a time. Focuses on describing the distribution of one variable using histograms, box plots, and summary statistics (mean, median, standard deviation).
Ask: What is the shape? Is it symmetric or skewed? Are there outliers?
Tools: Histograms, density plots, box plots.
Bivariate Analysis
Examines the relationship between two variables. Helps identify associations, correlations, or dependencies.
Ask: Is there a linear relationship? How strong is the correlation?
When data is collected over time (time series), the distribution may shift, trend, or exhibit seasonality. Line plots and rolling statistics help track how central tendency and spread change.
Trends: Upward or downward shifts in the distribution center over time.
Seasonality: Repeating patterns at regular intervals.
Tools: Line plots, rolling mean/standard deviation.
3.1.5 Confidence Measures and Data Reliability
Confidence measures quantify how certain we can be about our statistical estimates. They provide a range within which the true population parameter is likely to fall.
Standard Error of the Mean (SEM)
The standard error measures how much the sample mean is expected to vary from the true population mean.
# SEM = s / √n# where s = sample standard deviation, n = sample size
As sample size increases, SEM decreases — larger samples produce more reliable estimates.
SEM is always smaller than the standard deviation (it divides by √n).
Confidence Intervals
A confidence interval provides a range of values that is likely to contain the population parameter with a specified level of confidence (commonly 95%).
# 95% Confidence Interval = x̄ ± 1.96 * SEM# (For large samples using the Z-distribution)
A 95% confidence interval means: if we repeated the sampling process many times, about 95% of the intervals would contain the true population parameter.
Wider intervals indicate more uncertainty; narrower intervals indicate more precision.
Increasing sample size narrows the confidence interval.
Common Misconception
A 95% confidence interval does NOT mean there is a 95% probability that the true parameter lies within this specific interval. It means the procedure used to build the interval captures the true parameter 95% of the time over many samples.
3.1.6 Skewness and Kurtosis
Skewness and kurtosis describe the shape of a distribution beyond just center and spread.
Skewness
Skewness measures the asymmetry of a distribution.
Positive skew (right-skewed): The right tail is longer. Mean > Median > Mode. Example: income distribution — most people earn moderate amounts, but a few earn very high amounts.
Negative skew (left-skewed): The left tail is longer. Mean < Median < Mode. Example: exam scores on an easy test — most students score high, but a few score very low.
Zero skew: The distribution is symmetric. Mean = Median = Mode.
Kurtosis
Kurtosis measures the "tailedness" of a distribution — how heavy or light the tails are relative to a normal distribution.
Mesokurtic (kurtosis ≈ 3): Similar tails to a normal distribution. Excess kurtosis ≈ 0.
Leptokurtic (kurtosis > 3): Heavier tails and sharper peak. More extreme outliers. Excess kurtosis > 0.
Many software packages (including scipy.stats.kurtosis()) report excess kurtosis, which subtracts 3 from the raw kurtosis so that a normal distribution has excess kurtosis of 0.
Measure
What It Describes
Normal Value
Python Function
Skewness
Asymmetry (left/right tail)
0 (symmetric)
scipy.stats.skew()
Kurtosis
Tail heaviness
0 (excess kurtosis)
scipy.stats.kurtosis()
3.1.7 Identifying Outliers
Outliers are data points that differ significantly from other observations. They can result from measurement error, data entry mistakes, or genuine extreme values. Identifying and understanding outliers is critical for reliable analysis.
Visual Methods
Boxplots: Points beyond the whiskers (1.5 * IQR from Q1 or Q3) are flagged as outliers.
Histograms: Isolated bars far from the main body of data suggest outliers.
Scatterplots: Points far from the general cluster or regression line may be outliers.
Statistical Methods
IQR Method (Tukey's Fences)
A value is considered an outlier if it falls below Q1 - 1.5*IQR or above Q3 + 1.5*IQR.
# IQR Method for Outlier Detection
Q1 = np.percentile(data, 25)
Q3 = np.percentile(data, 75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers = [x for x in data if x < lower_bound or x > upper_bound]
Z-Score Method
The Z-score measures how many standard deviations a value is from the mean. Values with |Z| > 2 or |Z| > 3 are commonly considered outliers.
# Z-Score = (x - mean) / standard_deviation# Typically, |Z| > 3 flags a value as an outlier
Important
Before removing outliers, always investigate their cause. Outliers due to data entry errors should be corrected or removed. Genuine extreme values may carry important information and should be handled carefully (e.g., using robust statistics or transformations).
3.1.8 Correlation and Pearson's R
Correlation measures the strength and direction of a linear relationship between two variables. It is one of the most fundamental concepts in data analysis.
Types of Correlation
Positive correlation: As one variable increases, the other tends to increase. Example: height and weight.
Negative correlation: As one variable increases, the other tends to decrease. Example: hours of TV watched and exam scores.
No correlation (zero): No linear relationship between the variables. Example: shoe size and IQ.
Pearson's Correlation Coefficient (r)
Pearson's r quantifies the linear relationship between two continuous variables. Its value ranges from -1 to +1.
# Pearson's r formula:# r = Σ[(xᵢ - x̄)(yᵢ - ȳ)] / √[Σ(xᵢ - x̄)² * Σ(yᵢ - ȳ)²]## Equivalently: r = Cov(X, Y) / (s_x * s_y)
Interpreting Pearson's r
r Value
Strength
Direction
+0.7 to +1.0
Strong
Positive
+0.4 to +0.7
Moderate
Positive
+0.1 to +0.4
Weak
Positive
-0.1 to +0.1
Very weak / None
None
-0.4 to -0.1
Weak
Negative
-0.7 to -0.4
Moderate
Negative
-1.0 to -0.7
Strong
Negative
Correlation Does Not Imply Causation
A strong correlation between two variables does not mean one causes the other. There may be a confounding variable, reverse causation, or pure coincidence. Always consider the context and apply domain knowledge.
Key Properties of Pearson's r
Range: Always between -1 and +1.
r = +1: Perfect positive linear relationship.
r = -1: Perfect negative linear relationship.
r = 0: No linear relationship (but nonlinear relationships may still exist).
Unitless: Not affected by changes in units or scale of the variables.
Symmetric: r(X, Y) = r(Y, X).
Limitation: Only measures linear relationships. A perfect curve would yield r close to 0.
3.1.9 Interpreting Common Plot Types
Understanding what each type of plot reveals about data is essential for data analysis. Here is a guide to the most common plot types used in descriptive statistics.
Boxplot (Box-and-Whisker Plot)
Shows: Median, Q1, Q3, IQR, whiskers, and outliers.
Reveals: Data spread, skewness (asymmetric box), presence of outliers (dots beyond whiskers).
Use case: Comparing distributions across categories; identifying outliers.
Reading: The box spans Q1 to Q3. The line inside is the median. Whiskers extend to 1.5*IQR. Points beyond are outliers.
Histogram
Shows: Frequency distribution of a single continuous variable.
Reveals: Shape (normal, skewed, bimodal), center, spread, and gaps in data.
Use case: Understanding the overall distribution of a single variable.
Reading: Each bar represents a range (bin). The height shows how many values fall in that range.
Scatterplot
Shows: The relationship between two continuous variables.
Reveals: Direction (positive/negative), strength, and form (linear/nonlinear) of relationships; outliers.
Use case: Bivariate analysis; checking for correlation before computing Pearson's r.
Reading: Each point is one observation. Patterns (upward, downward, cluster, fan shape) indicate relationships.
Line Plot
Shows: Trends over time or a continuous ordered variable.
Reveals: Trends, seasonality, cycles, and sudden changes in data over time.
Use case: Time series analysis; tracking metrics over time.
Reading: X-axis is time/order, Y-axis is the measured value. Look for upward/downward trends and repeating patterns.
Correlation Heatmap
Shows: Correlation coefficients between all pairs of variables in a matrix.
Reveals: Which pairs of variables are strongly correlated (positive or negative); multicollinearity.
Use case: Multivariate analysis; feature selection; identifying redundant variables.
Reading: Colors represent correlation strength. Dark red/blue = strong correlation. White/light = weak correlation. The diagonal is always 1 (a variable with itself).
Plot Selection Guide
Use histograms for distribution shape, boxplots for comparing groups and finding outliers, scatterplots for two-variable relationships, line plots for time trends, and heatmaps for an overview of all pairwise correlations.
Plot Type
Variables
Primary Purpose
Histogram
1 continuous
Distribution shape
Boxplot
1 continuous (optionally grouped by 1 categorical)
Q1. Which measure of central tendency is most appropriate for a dataset with significant outliers?
A) Mean
B) Median
C) Mode
D) Standard deviation
Correct: B) Median. The median is robust to outliers because it depends only on the middle value(s) in the sorted data. The mean is pulled toward extreme values, and mode measures frequency rather than central position. Standard deviation is a measure of spread, not central tendency.
Q2. In NumPy, what does np.var(data) compute by default?
A) Population variance (ddof=0)
B) Sample variance (ddof=1)
C) Standard deviation
D) Interquartile range
Correct: A) Population variance (ddof=0). By default, NumPy's np.var() uses ddof=0, which computes the population variance. To compute the sample variance with Bessel's correction, you must pass ddof=1.
Q3. What does a Pearson's r value of -0.85 indicate?
A) A weak positive linear relationship
B) A strong negative linear relationship
C) No linear relationship
D) A strong positive linear relationship
Correct: B) A strong negative linear relationship. Pearson's r ranges from -1 to +1. A value of -0.85 is close to -1, indicating a strong inverse relationship: as one variable increases, the other tends to decrease.
Q4. According to the 68-95-99.7 rule, approximately what percentage of data falls within 2 standard deviations of the mean in a normal distribution?
A) 68%
B) 95%
C) 99.7%
D) 50%
Correct: B) 95%. The 68-95-99.7 rule states that approximately 68% of data falls within 1 SD, 95% within 2 SDs, and 99.7% within 3 SDs of the mean in a normal distribution.
Q5. Which plot type is best suited for identifying outliers in a dataset?
A) Line plot
B) Boxplot
C) Correlation heatmap
D) Pie chart
Correct: B) Boxplot. Boxplots explicitly display outliers as individual points beyond the whiskers (1.5 * IQR from Q1 and Q3). They provide a clear visual summary of the data spread and make outliers immediately visible.
Q6. A dataset has a positive skew. Which statement is true about the relationship between mean and median?
A) Mean = Median
B) Mean > Median
C) Mean < Median
D) Mean and Median are unrelated to skewness
Correct: B) Mean > Median. In a positively (right) skewed distribution, the right tail is longer. The mean is pulled toward the tail, making it greater than the median. The general relationship is: Mean > Median > Mode for positive skew.
Q7. Using the IQR method, a value is considered an outlier if it falls:
A) More than 1 standard deviation from the mean
B) Beyond the range (min to max)
C) Below Q1 - 1.5*IQR or above Q3 + 1.5*IQR
D) Below Q1 - 3*IQR or above Q3 + 3*IQR
Correct: C) Below Q1 - 1.5*IQR or above Q3 + 1.5*IQR. The standard IQR method (Tukey's fences) defines outliers as values falling below Q1 - 1.5*IQR or above Q3 + 1.5*IQR. Values beyond 3*IQR are sometimes called "extreme outliers."
Q8. What does the scipy.stats.pearsonr() function return?
A) Only the correlation coefficient
B) The correlation coefficient and the p-value
C) The correlation matrix
D) The covariance and standard deviations
Correct: B) The correlation coefficient and the p-value.scipy.stats.pearsonr(x, y) returns a tuple of (r, p-value), where r is Pearson's correlation coefficient and the p-value indicates the statistical significance of the correlation. Use np.corrcoef() for the full correlation matrix.
Q9. In a correlation heatmap, a cell colored dark blue between variables A and B indicates:
A) A strong positive correlation between A and B
B) A strong negative correlation between A and B
C) No correlation between A and B
D) A and B are identical variables
Correct: B) A strong negative correlation between A and B. In the commonly used "coolwarm" colormap, dark blue represents values close to -1 (strong negative correlation), dark red represents values close to +1 (strong positive correlation), and white/light colors represent values near 0 (weak or no correlation).
Q10. Which statement about standard deviation and variance is FALSE?
A) Standard deviation is the square root of variance
B) Variance is expressed in the same units as the original data
C) A larger standard deviation indicates more spread in the data
D) Sample variance uses n-1 in the denominator (Bessel's correction)
Correct: B) Variance is expressed in the same units as the original data. This statement is FALSE. Variance is expressed in squared units (e.g., meters squared). It is the standard deviation that is expressed in the same units as the original data, making it easier to interpret.