Master Matplotlib and Seaborn to create, customize, and interpret data visualizations for analysis and communication.
Objectives: 3 | Weight: 10.4% of exam
5.1.1 Matplotlib and Seaborn Proficiency
Matplotlib is the foundational plotting library in Python. Seaborn is built on top of Matplotlib and provides a higher-level, more aesthetically pleasing interface with built-in support for statistical visualizations. Together they cover virtually every visualization need a data analyst encounters.
Exam Focus
Know the correct function calls for each chart type, understand what each visualization reveals about the data, and be comfortable with both the Matplotlib and Seaborn syntax for creating them.
Creating Boxplots
Boxplots (box-and-whisker plots) display the five-number summary of a dataset: minimum, first quartile (Q1), median (Q2), third quartile (Q3), and maximum. They are ideal for detecting outliers and comparing distributions across categories.
Box: spans from Q1 to Q3 (the interquartile range, IQR)
Line inside box: the median
Whiskers: extend up to 1.5 × IQR from Q1 and Q3
Points beyond whiskers: outliers
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
# Generate sample data
np.random.seed(42)
data_a = np.random.normal(50, 10, 200)
data_b = np.random.normal(60, 15, 200)
data_c = np.random.normal(45, 8, 200)
# --- Matplotlib boxplot ---
fig, ax = plt.subplots(figsize=(8, 5))
bp = ax.boxplot([data_a, data_b, data_c],
labels=['Group A', 'Group B', 'Group C'],
patch_artist=True, # fill boxes with color
notch=True, # add confidence notch
medianprops={'color': 'black', 'linewidth': 2})
colors = ['#dbeafe', '#d1fae5', '#ede9fe']
for patch, color in zip(bp['boxes'], colors):
patch.set_facecolor(color)
ax.set_title('Score Distribution by Group', fontweight='bold')
ax.set_ylabel('Score')
ax.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()
# --- Seaborn boxplot ---
df = pd.DataFrame({
'Score': np.concatenate([data_a, data_b, data_c]),
'Group': ['A']*200 + ['B']*200 + ['C']*200
})
sns.boxplot(x='Group', y='Score', data=df, palette='pastel')
plt.title('Seaborn Boxplot')
plt.show()
When to Use a Boxplot
Use boxplots to compare distributions across categories, identify outliers, and quickly assess whether groups differ in central tendency or spread. They are more informative than bar charts for continuous data.
Creating Histograms
Histograms show the distribution shape of a single numeric variable by grouping values into bins and counting frequencies. They reveal skewness, modality (unimodal, bimodal), and spread.
Bin selection matters: too few bins over-smooth; too many create noise
KDE (Kernel Density Estimate): a smooth curve overlay that estimates the probability density function
# --- Histogram with KDE overlay ---
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
# Matplotlib histogram
axes[0].hist(data_a, bins=20, color='#93c5fd', edgecolor='white', alpha=0.8)
axes[0].set_title('Matplotlib Histogram (20 bins)')
axes[0].set_xlabel('Value')
axes[0].set_ylabel('Frequency')
# Seaborn histplot with KDE
sns.histplot(data_a, bins=25, kde=True, color='#7c3aed', ax=axes[1])
axes[1].set_title('Seaborn histplot with KDE')
axes[1].set_xlabel('Value')
plt.tight_layout()
plt.show()
# Comparing multiple distributions on one histogram
plt.hist(data_a, bins=20, alpha=0.5, label='Group A', color='#2563eb')
plt.hist(data_b, bins=20, alpha=0.5, label='Group B', color='#059669')
plt.legend()
plt.title('Overlapping Histograms')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.show()
Exam Pitfall: Histograms vs Bar Charts
Histograms are for continuous numeric data and have no gaps between bars. Bar charts are for categorical data and have gaps between bars. Do not confuse them on the exam.
Creating Scatterplots
Scatterplots show the relationship between two continuous variables. Each point represents an observation. They are essential for identifying correlations, clusters, and outliers.
# --- Scatterplot with color mapping and size ---
np.random.seed(42)
n = 100
df_scatter = pd.DataFrame({
'Hours_Studied': np.random.uniform(1, 10, n),
'Exam_Score': np.random.uniform(40, 100, n),
'Attendance': np.random.uniform(50, 100, n),
'Grade': np.random.choice(['A', 'B', 'C'], n)
})
df_scatter['Exam_Score'] = df_scatter['Hours_Studied'] * 7 + np.random.normal(0, 8, n)
# Matplotlib scatter with color and size
fig, ax = plt.subplots(figsize=(8, 6))
scatter = ax.scatter(
df_scatter['Hours_Studied'],
df_scatter['Exam_Score'],
c=df_scatter['Attendance'], # color mapped to attendance
s=df_scatter['Attendance'], # size mapped to attendance
cmap='viridis',
alpha=0.7,
edgecolors='white'
)
plt.colorbar(scatter, label='Attendance %')
ax.set_xlabel('Hours Studied')
ax.set_ylabel('Exam Score')
ax.set_title('Study Hours vs Exam Score', fontweight='bold')
plt.tight_layout()
plt.show()
# Seaborn scatterplot with hue
sns.scatterplot(x='Hours_Studied', y='Exam_Score',
hue='Grade', size='Attendance',
data=df_scatter, palette='Set2', alpha=0.8)
plt.title('Seaborn Scatterplot with Hue and Size')
plt.show()
Creating Line Plots
Line plots display trends over time or ordered sequences. They connect data points with lines, making it easy to see increases, decreases, and patterns like seasonality.
Heatmaps use color intensity to represent values in a matrix. The most common analytical use is displaying a correlation matrix, which shows pairwise correlation coefficients between all numeric columns.
Key Parameters for sns.heatmap() annot=True displays numeric values in each cell. fmt='.2f' controls decimal places. cmap='coolwarm' uses a diverging color scale (red for negative, blue for positive). center=0 ensures the color midpoint is at zero correlation.
Figure, Axes, and Subplot Layouts
Matplotlib uses a Figure (the entire canvas) and Axes (individual plots) model. The function plt.subplots() creates both at once and is the recommended approach for multi-panel figures.
fig, ax = plt.subplots()
This is the standard pattern. fig controls the overall figure (size, title, saving). ax (or axes for multi-plot) controls individual charts. When you have a 2D array of axes, access them with axes[row, col].
5.1.2 Pros and Cons of Data Representations
Choosing the right visualization is just as important as creating one. The wrong chart can mislead, confuse, or hide key insights. This section covers when to use each chart type and how to critically evaluate visualization effectiveness.
Choosing the Right Chart
Bar Charts vs Histograms
Feature
Bar Chart
Histogram
Data type
Categorical
Continuous / numeric
X-axis
Discrete categories
Numeric ranges (bins)
Bar spacing
Gaps between bars
No gaps (adjacent bins)
Purpose
Compare counts/values across categories
Show distribution shape
Example
Sales by region
Distribution of student ages
Scatter Plots vs Line Plots
Feature
Scatter Plot
Line Plot
Best for
Relationship between two variables
Trends over time or ordered data
X-axis
Any continuous variable
Typically time or sequential order
Data points
Individual points, not connected
Connected by lines
Reveals
Correlations, clusters, outliers
Trends, seasonality, rate of change
Example
Height vs weight
Stock price over months
Comprehensive Chart Selection Guide
Data Scenario
Best Chart
Why
Distribution of one variable
Histogram / KDE plot
Shows shape, central tendency, spread
Comparing groups
Boxplot / Violin plot
Shows median, quartiles, outliers per group
Relationship between two numeric variables
Scatterplot
Shows correlation, clusters, outliers
Trend over time
Line plot
Connects points to show change over time
Comparing categorical counts
Bar chart
Clear comparison of discrete categories
Pairwise correlations
Heatmap
Color-coded matrix of all variable pairs
Part-of-whole (few categories)
Stacked bar chart (preferred over pie)
Easier to compare proportions
Pie Charts: Limitations and Alternatives
Pie charts show proportions of a whole, but they have significant limitations:
Hard to compare: humans are poor at judging angles and areas accurately
Too many slices: more than 5-6 categories makes pie charts nearly unreadable
No precise comparison: it is difficult to compare slices of similar size
3D pie charts: distort proportions further and should always be avoided
Better alternatives:
Horizontal bar chart: easy to compare exact values
Stacked bar chart: shows part-of-whole while allowing comparison across groups
Treemap: good for hierarchical part-of-whole data
Exam Tip
The exam may ask you to identify which chart type is most appropriate for a given scenario. Remember: pie charts are rarely the best answer. Bar charts, histograms, boxplots, and scatterplots each serve a specific purpose — match the chart to the data type and the question you are trying to answer.
Critically Analyzing Visualization Effectiveness
When evaluating a visualization, ask:
Does the chart type match the data type (categorical, continuous, time-series)?
Is the y-axis starting from zero? (Truncated axes can exaggerate differences.)
Are axes labeled and units included?
Is the color scheme accessible (colorblind-friendly)?
Does the visualization answer the intended question clearly?
Is there unnecessary "chart junk" (3D effects, excessive gridlines, decorative elements)?
5.1.3 Labeling, Annotating, and Refining
A visualization without proper labels, annotations, and styling is incomplete. This section covers how to transform raw plots into publication-quality, professional visualizations that communicate data insights effectively.
Labels and Titles
Every chart should have at minimum: a title, x-axis label, and y-axis label. These provide essential context so the viewer can interpret the chart without additional explanation.
Annotations highlight specific data points with text and arrows, drawing the viewer's attention to important findings.
# --- Annotations ---
fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(months, product_a, marker='o', linewidth=2, color='#2563eb')
# plt.annotate() - adds text with an arrow pointing to data
ax.annotate('Peak Sales',
xy=(11, 250), # point to annotate (x, y)
xytext=(8, 260), # text position
fontsize=11,
fontweight='bold',
arrowprops=dict(
arrowstyle='->',
color='#dc2626',
linewidth=2
),
color='#dc2626')
# plt.text() - adds text at a specific position (no arrow)
ax.text(3, 145, 'Growth Phase', fontsize=10,
fontstyle='italic', color='gray',
bbox=dict(boxstyle='round,pad=0.3', facecolor='#fef3c7', alpha=0.8))
ax.set_title('Sales Trend with Annotations', fontweight='bold')
ax.set_xlabel('Month')
ax.set_ylabel('Units Sold')
plt.tight_layout()
plt.show()
annotate() vs text() plt.annotate() supports arrowprops to draw an arrow from the text to the data point. plt.text() places text at coordinates but has no arrow. Use annotate() when you need to point at a specific data value.
Colors, Legends, and Full Customization
Color Customization
Colors convey meaning and improve readability. Matplotlib supports named colors, hex codes, RGB tuples, and colormaps.
Legend Manipulation
Legends identify data series. You can control their position, font size, background, and border.
Axes and Tick Formatting
Customize axis limits, tick positions, tick labels, and gridlines for clarity.
# --- Full customization example ---
import matplotlib.ticker as mticker
fig, ax = plt.subplots(figsize=(10, 6))
# Plot data with custom colors
ax.plot(months, product_a, marker='o', linewidth=2.5,
color='#2563eb', markersize=8, label='Product A',
markerfacecolor='white', markeredgewidth=2)
ax.plot(months, product_b, marker='s', linewidth=2.5,
color='#059669', markersize=8, label='Product B',
markerfacecolor='white', markeredgewidth=2)
# Title and labels
ax.set_title('2024 Monthly Sales Comparison', fontsize=16, fontweight='bold', pad=20)
ax.set_xlabel('Month', fontsize=12)
ax.set_ylabel('Units Sold', fontsize=12)
# Legend customization
ax.legend(
loc='upper left', # position
fontsize=11, # font size
framealpha=0.9, # background transparency
edgecolor='#e2e8f0', # border color
facecolor='white', # background color
shadow=True, # drop shadow
title='Products', # legend title
title_fontsize=12
)
# Gridlines
ax.grid(True, linestyle='--', alpha=0.4, color='gray')
ax.set_axisbelow(True) # grid behind data
# Axes formatting
ax.set_ylim(0, 300)
ax.yaxis.set_major_locator(mticker.MultipleLocator(50))
ax.tick_params(axis='both', labelsize=10)
# Annotation on peak
ax.annotate('Record High: 250 units',
xy=(11, 250), xytext=(7, 270),
fontsize=10, fontweight='bold', color='#dc2626',
arrowprops=dict(arrowstyle='->', color='#dc2626', lw=1.5))
# Remove top and right spines for cleaner look
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
plt.tight_layout()
plt.show()
Seaborn Styles and Palettes
Seaborn provides built-in style presets and color palettes that instantly improve the appearance of all Matplotlib and Seaborn plots.
# --- Seaborn styles and palettes ---
# Set the overall style
sns.set_style('whitegrid') # Options: white, dark, whitegrid, darkgrid, ticks
# Set a color palette globally
sns.set_palette('Set2') # Options: deep, muted, bright, pastel, dark, Set1, Set2...
# Use a specific palette in a single plot
sns.boxplot(x='Group', y='Score', data=df, palette='husl')
# Custom color palette
custom_palette = ['#2563eb', '#059669', '#7c3aed', '#d97706']
sns.set_palette(custom_palette)
# Available built-in styles:
# 'white' - clean white background, no grid
# 'dark' - dark gray background
# 'whitegrid' - white background with grid lines
# 'darkgrid' - gray background with grid lines (default)
# 'ticks' - white background with ticks on axes
# Reset to defaults
sns.reset_defaults()
Style
Best For
whitegrid
Quantitative data that benefits from gridlines for reading values
darkgrid
Default style; good for presentations with many plot elements
white
Minimalist look for publication-quality figures
ticks
Clean look with axis ticks for precise reading
dark
Dark background; rarely used in print
Visual Exploration and Data-Driven Decisions
Visualizations are not just for presenting results — they are a critical tool during exploratory data analysis (EDA). Before running statistical tests or building models, use plots to:
Generate hypotheses: a scatterplot may reveal a correlation you did not expect, leading to a new research question
Detect data quality issues: histograms can reveal unexpected gaps or spikes, boxplots expose outliers
Understand distributions: check normality assumptions before applying parametric tests
Guide feature selection: heatmaps highlight which variables are strongly correlated
Communicate findings: the right visualization makes complex data accessible to non-technical stakeholders
Making Data-Driven Decisions
Always let the data guide your conclusions, not the other way around. A well-constructed visualization helps you see patterns objectively. When presenting, pair each visualization with a clear interpretation statement: "This chart shows that X increased by Y% during Z period, suggesting..."
Practice Quiz: Data Visualization Techniques
Q1. Which component of a boxplot represents the interquartile range (IQR)?
A) The whiskers extending from the box
B) The box itself, spanning from Q1 to Q3
C) The line inside the box
D) The individual points beyond the whiskers
The box in a boxplot spans from the first quartile (Q1) to the third quartile (Q3). The distance between Q1 and Q3 is the interquartile range (IQR). The line inside the box is the median, the whiskers extend up to 1.5 × IQR, and points beyond the whiskers are outliers.
Q2. What is the primary difference between a histogram and a bar chart?
A) Histograms use vertical bars; bar charts use horizontal bars
B) Histograms have colors; bar charts are grayscale
C) Histograms display continuous numeric data with adjacent bins; bar charts display categorical data with gaps between bars
D) There is no meaningful difference; the terms are interchangeable
Histograms are used for continuous numeric data. Values are grouped into bins and bars touch each other because the data is continuous. Bar charts represent discrete categorical data, and the bars have gaps between them to emphasize that categories are separate.
Q3. Which Seaborn function is used to create a correlation heatmap from a DataFrame?
A) sns.corrplot(df)
B) sns.heatmap(df.corr())
C) sns.correlation(df)
D) sns.pairplot(df.corr())
To create a correlation heatmap, first compute the correlation matrix with df.corr(), then pass it to sns.heatmap(). Common parameters include annot=True to show values and cmap='coolwarm' for a diverging color scheme.
Q4. You want to visualize the trend of monthly revenue over two years. Which chart type is most appropriate?
A) Pie chart
B) Boxplot
C) Line plot
D) Histogram
Line plots are the best choice for showing trends over time. They connect data points chronologically, making it easy to see increases, decreases, and seasonal patterns. A pie chart shows proportions, a boxplot shows distribution summaries, and a histogram shows frequency distributions.
Q5. What does the annot=True parameter do in sns.heatmap()?
A) Adds a title annotation above the heatmap
B) Adds axis labels automatically
C) Displays the numeric value inside each cell of the heatmap
D) Annotates outlier values with arrows
Setting annot=True in sns.heatmap() writes the numeric value of each cell directly on the heatmap. This makes it easier to read exact correlation coefficients. Use fmt='.2f' to control the number of decimal places displayed.
Q6. Which function adds text with an arrow pointing to a specific data point on a Matplotlib plot?
A) plt.text()
B) plt.annotate()
C) plt.label()
D) plt.arrow()
plt.annotate() (or ax.annotate()) is used to add text with an arrow pointing to a specific data point. It uses the xy parameter for the point location and xytext for the text position. plt.text() places text at coordinates but does not support arrows via arrowprops.
Q7. What is the correct way to create a 2-row, 3-column grid of subplots in Matplotlib?
A) fig, ax = plt.subplot(2, 3)
B) fig, axes = plt.subplots(2, 3)
C) fig = plt.figure(2, 3)
D) fig, axes = plt.grid(2, 3)
plt.subplots(nrows, ncols) creates a Figure and an array of Axes objects. For a 2×3 grid, fig, axes = plt.subplots(2, 3) returns axes as a 2D numpy array where you access individual plots with axes[row, col]. Note the plural "subplots" (not "subplot").
Q8. Which of the following is a major limitation of pie charts?
A) They cannot display more than two categories
B) Humans find it difficult to accurately compare angles and areas, especially for similar-sized slices
C) They can only display continuous numerical data
D) They require more computational resources than other chart types
The primary limitation of pie charts is that humans are poor at judging and comparing angles and areas. When slices are similar in size, it becomes nearly impossible to determine which is larger. This is why bar charts are generally preferred — the human visual system is much better at comparing lengths.
Q9. What does sns.set_style('whitegrid') do?
A) Sets the background color of all plots to white
B) Applies a Seaborn theme with a white background and visible grid lines to all subsequent plots
C) Adds a white grid overlay on top of the current plot only
D) Resets all Seaborn settings to default
sns.set_style('whitegrid') sets a global aesthetic theme for all subsequent plots. The 'whitegrid' style provides a clean white background with horizontal and vertical grid lines, which is excellent for quantitative data where precise value reading is important.
Q10. You want to explore the relationship between two numeric variables and also encode a third categorical variable using color. Which approach is best?
A) plt.hist(x, color=category)
B) sns.scatterplot(x='var1', y='var2', hue='category', data=df)
C) sns.heatmap(df[['var1', 'var2']])
D) plt.pie(df['var1'], labels=df['category'])
A scatterplot shows the relationship between two numeric variables, and the hue parameter in sns.scatterplot() maps a categorical variable to point color. This lets you see both the bivariate relationship and how it differs across groups — all in a single chart.