Topic 5.1: Data Visualization Techniques

Objectives Overview

5.1.1 — Matplotlib and Seaborn Proficiency
5.1.2 — Pros and Cons of Data Representations
5.1.3 — Labeling, Annotating, and Refining

Master Matplotlib and Seaborn to create, customize, and interpret data visualizations for analysis and communication.

Objectives: 3 | Weight: 10.4% of exam

5.1.1 Matplotlib and Seaborn Proficiency

Matplotlib is the foundational plotting library in Python. Seaborn is built on top of Matplotlib and provides a higher-level, more aesthetically pleasing interface with built-in support for statistical visualizations. Together they cover virtually every visualization need a data analyst encounters.

Exam Focus
Know the correct function calls for each chart type, understand what each visualization reveals about the data, and be comfortable with both the Matplotlib and Seaborn syntax for creating them.

Creating Boxplots

Boxplots (box-and-whisker plots) display the five-number summary of a dataset: minimum, first quartile (Q1), median (Q2), third quartile (Q3), and maximum. They are ideal for detecting outliers and comparing distributions across categories.

Box: spans from Q1 to Q3 (the interquartile range, IQR)
Line inside box: the median
Whiskers: extend up to 1.5 × IQR from Q1 and Q3
Points beyond whiskers: outliers

import matplotlib.pyplot as plt import seaborn as sns import numpy as np import pandas as pd # Generate sample data np.random.seed(42) data_a = np.random.normal(50, 10, 200) data_b = np.random.normal(60, 15, 200) data_c = np.random.normal(45, 8, 200) # --- Matplotlib boxplot --- fig, ax = plt.subplots(figsize=(8, 5)) bp = ax.boxplot([data_a, data_b, data_c], labels=['Group A', 'Group B', 'Group C'], patch_artist=True, # fill boxes with color notch=True, # add confidence notch medianprops={'color': 'black', 'linewidth': 2}) colors = ['#dbeafe', '#d1fae5', '#ede9fe'] for patch, color in zip(bp['boxes'], colors): patch.set_facecolor(color) ax.set_title('Score Distribution by Group', fontweight='bold') ax.set_ylabel('Score') ax.grid(axis='y', alpha=0.3) plt.tight_layout() plt.show() # --- Seaborn boxplot --- df = pd.DataFrame({ 'Score': np.concatenate([data_a, data_b, data_c]), 'Group': ['A']*200 + ['B']*200 + ['C']*200 }) sns.boxplot(x='Group', y='Score', data=df, palette='pastel') plt.title('Seaborn Boxplot') plt.show()

When to Use a Boxplot
Use boxplots to compare distributions across categories, identify outliers, and quickly assess whether groups differ in central tendency or spread. They are more informative than bar charts for continuous data.

Creating Histograms

Histograms show the distribution shape of a single numeric variable by grouping values into bins and counting frequencies. They reveal skewness, modality (unimodal, bimodal), and spread.

Bin selection matters: too few bins over-smooth; too many create noise
KDE (Kernel Density Estimate): a smooth curve overlay that estimates the probability density function

# --- Histogram with KDE overlay --- fig, axes = plt.subplots(1, 2, figsize=(12, 5)) # Matplotlib histogram axes[0].hist(data_a, bins=20, color='#93c5fd', edgecolor='white', alpha=0.8) axes[0].set_title('Matplotlib Histogram (20 bins)') axes[0].set_xlabel('Value') axes[0].set_ylabel('Frequency') # Seaborn histplot with KDE sns.histplot(data_a, bins=25, kde=True, color='#7c3aed', ax=axes[1]) axes[1].set_title('Seaborn histplot with KDE') axes[1].set_xlabel('Value') plt.tight_layout() plt.show() # Comparing multiple distributions on one histogram plt.hist(data_a, bins=20, alpha=0.5, label='Group A', color='#2563eb') plt.hist(data_b, bins=20, alpha=0.5, label='Group B', color='#059669') plt.legend() plt.title('Overlapping Histograms') plt.xlabel('Value') plt.ylabel('Frequency') plt.show()

Exam Pitfall: Histograms vs Bar Charts
Histograms are for continuous numeric data and have no gaps between bars. Bar charts are for categorical data and have gaps between bars. Do not confuse them on the exam.

Creating Scatterplots

Scatterplots show the relationship between two continuous variables. Each point represents an observation. They are essential for identifying correlations, clusters, and outliers.

# --- Scatterplot with color mapping and size --- np.random.seed(42) n = 100 df_scatter = pd.DataFrame({ 'Hours_Studied': np.random.uniform(1, 10, n), 'Exam_Score': np.random.uniform(40, 100, n), 'Attendance': np.random.uniform(50, 100, n), 'Grade': np.random.choice(['A', 'B', 'C'], n) }) df_scatter['Exam_Score'] = df_scatter['Hours_Studied'] * 7 + np.random.normal(0, 8, n) # Matplotlib scatter with color and size fig, ax = plt.subplots(figsize=(8, 6)) scatter = ax.scatter( df_scatter['Hours_Studied'], df_scatter['Exam_Score'], c=df_scatter['Attendance'], # color mapped to attendance s=df_scatter['Attendance'], # size mapped to attendance cmap='viridis', alpha=0.7, edgecolors='white' ) plt.colorbar(scatter, label='Attendance %') ax.set_xlabel('Hours Studied') ax.set_ylabel('Exam Score') ax.set_title('Study Hours vs Exam Score', fontweight='bold') plt.tight_layout() plt.show() # Seaborn scatterplot with hue sns.scatterplot(x='Hours_Studied', y='Exam_Score', hue='Grade', size='Attendance', data=df_scatter, palette='Set2', alpha=0.8) plt.title('Seaborn Scatterplot with Hue and Size') plt.show()

Creating Line Plots

Line plots display trends over time or ordered sequences. They connect data points with lines, making it easy to see increases, decreases, and patterns like seasonality.

# --- Line plot with multiple series --- months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'] product_a = [120, 135, 148, 162, 155, 170, 190, 210, 198, 220, 235, 250] product_b = [80, 95, 110, 105, 115, 130, 125, 140, 155, 160, 170, 185] # Matplotlib line plot fig, ax = plt.subplots(figsize=(10, 5)) ax.plot(months, product_a, marker='o', linewidth=2, label='Product A', color='#2563eb') ax.plot(months, product_b, marker='s', linewidth=2, label='Product B', color='#059669') ax.set_xlabel('Month') ax.set_ylabel('Sales (units)') ax.set_title('Monthly Sales Comparison', fontweight='bold') ax.legend() ax.grid(alpha=0.3) plt.tight_layout() plt.show() # Seaborn lineplot (long-form data) df_sales = pd.DataFrame({ 'Month': months * 2, 'Sales': product_a + product_b, 'Product': ['A']*12 + ['B']*12 }) sns.lineplot(x='Month', y='Sales', hue='Product', data=df_sales, marker='o') plt.title('Seaborn Line Plot') plt.show()

Correlation Heatmaps

Heatmaps use color intensity to represent values in a matrix. The most common analytical use is displaying a correlation matrix, which shows pairwise correlation coefficients between all numeric columns.

+1: perfect positive correlation
0: no linear correlation
-1: perfect negative correlation

# --- Correlation Heatmap --- df_corr = pd.DataFrame({ 'Math': np.random.normal(75, 10, 200), 'Science': np.random.normal(70, 12, 200), 'English': np.random.normal(68, 15, 200), 'Hours_Study': np.random.normal(5, 2, 200), }) # Add realistic correlations df_corr['Science'] = df_corr['Math'] * 0.8 + np.random.normal(0, 8, 200) df_corr['Hours_Study'] = df_corr['Math'] * 0.05 + np.random.normal(0, 2, 200) # Compute correlation matrix corr_matrix = df_corr.corr() # Plot heatmap with annotations fig, ax = plt.subplots(figsize=(8, 6)) sns.heatmap(corr_matrix, annot=True, # show correlation values fmt='.2f', # format to 2 decimal places cmap='coolwarm', # diverging colormap center=0, # center colormap at 0 vmin=-1, vmax=1, # full correlation range square=True, # square cells linewidths=0.5, ax=ax) ax.set_title('Correlation Heatmap', fontweight='bold', pad=15) plt.tight_layout() plt.show()

Key Parameters for sns.heatmap()
annot=True displays numeric values in each cell. fmt='.2f' controls decimal places. cmap='coolwarm' uses a diverging color scale (red for negative, blue for positive). center=0 ensures the color midpoint is at zero correlation.

Figure, Axes, and Subplot Layouts

Matplotlib uses a Figure (the entire canvas) and Axes (individual plots) model. The function plt.subplots() creates both at once and is the recommended approach for multi-panel figures.

# --- Subplot layouts --- fig, axes = plt.subplots(2, 2, figsize=(12, 10)) # Top-left: Histogram axes[0, 0].hist(data_a, bins=20, color='#93c5fd', edgecolor='white') axes[0, 0].set_title('Distribution of Scores') axes[0, 0].set_xlabel('Score') axes[0, 0].set_ylabel('Frequency') # Top-right: Boxplot axes[0, 1].boxplot([data_a, data_b], labels=['A', 'B'], patch_artist=True) axes[0, 1].set_title('Group Comparison') # Bottom-left: Scatter axes[1, 0].scatter(df_scatter['Hours_Studied'], df_scatter['Exam_Score'], alpha=0.6, color='#7c3aed') axes[1, 0].set_title('Study Hours vs Score') axes[1, 0].set_xlabel('Hours') axes[1, 0].set_ylabel('Score') # Bottom-right: Line plot axes[1, 1].plot(months, product_a, marker='o', label='A') axes[1, 1].plot(months, product_b, marker='s', label='B') axes[1, 1].set_title('Monthly Trend') axes[1, 1].legend() axes[1, 1].tick_params(axis='x', rotation=45) # Adjust spacing between subplots fig.suptitle('Dashboard: Multiple Visualization Types', fontsize=14, fontweight='bold') plt.tight_layout() plt.show()

fig, ax = plt.subplots()
This is the standard pattern. fig controls the overall figure (size, title, saving). ax (or axes for multi-plot) controls individual charts. When you have a 2D array of axes, access them with axes[row, col].

5.1.2 Pros and Cons of Data Representations

Choosing the right visualization is just as important as creating one. The wrong chart can mislead, confuse, or hide key insights. This section covers when to use each chart type and how to critically evaluate visualization effectiveness.

Choosing the Right Chart

Bar Charts vs Histograms

Feature	Bar Chart	Histogram
Data type	Categorical	Continuous / numeric
X-axis	Discrete categories	Numeric ranges (bins)
Bar spacing	Gaps between bars	No gaps (adjacent bins)
Purpose	Compare counts/values across categories	Show distribution shape
Example	Sales by region	Distribution of student ages

Scatter Plots vs Line Plots

Feature	Scatter Plot	Line Plot
Best for	Relationship between two variables	Trends over time or ordered data
X-axis	Any continuous variable	Typically time or sequential order
Data points	Individual points, not connected	Connected by lines
Reveals	Correlations, clusters, outliers	Trends, seasonality, rate of change
Example	Height vs weight	Stock price over months

Comprehensive Chart Selection Guide

Data Scenario	Best Chart	Why
Distribution of one variable	Histogram / KDE plot	Shows shape, central tendency, spread
Comparing groups	Boxplot / Violin plot	Shows median, quartiles, outliers per group
Relationship between two numeric variables	Scatterplot	Shows correlation, clusters, outliers
Trend over time	Line plot	Connects points to show change over time
Comparing categorical counts	Bar chart	Clear comparison of discrete categories
Pairwise correlations	Heatmap	Color-coded matrix of all variable pairs
Part-of-whole (few categories)	Stacked bar chart (preferred over pie)	Easier to compare proportions

Pie Charts: Limitations and Alternatives

Pie charts show proportions of a whole, but they have significant limitations:

Hard to compare: humans are poor at judging angles and areas accurately
Too many slices: more than 5-6 categories makes pie charts nearly unreadable
No precise comparison: it is difficult to compare slices of similar size
3D pie charts: distort proportions further and should always be avoided

Better alternatives:

Horizontal bar chart: easy to compare exact values
Stacked bar chart: shows part-of-whole while allowing comparison across groups
Treemap: good for hierarchical part-of-whole data

Exam Tip
The exam may ask you to identify which chart type is most appropriate for a given scenario. Remember: pie charts are rarely the best answer. Bar charts, histograms, boxplots, and scatterplots each serve a specific purpose — match the chart to the data type and the question you are trying to answer.

Critically Analyzing Visualization Effectiveness

When evaluating a visualization, ask:

Does the chart type match the data type (categorical, continuous, time-series)?
Is the y-axis starting from zero? (Truncated axes can exaggerate differences.)
Are axes labeled and units included?
Is the color scheme accessible (colorblind-friendly)?
Does the visualization answer the intended question clearly?
Is there unnecessary "chart junk" (3D effects, excessive gridlines, decorative elements)?

5.1.3 Labeling, Annotating, and Refining

A visualization without proper labels, annotations, and styling is incomplete. This section covers how to transform raw plots into publication-quality, professional visualizations that communicate data insights effectively.

Labels and Titles

Every chart should have at minimum: a title, x-axis label, and y-axis label. These provide essential context so the viewer can interpret the chart without additional explanation.

# --- Basic labeling --- fig, ax = plt.subplots(figsize=(8, 5)) ax.plot(months, product_a, marker='o') # Title and labels ax.set_title('Monthly Sales for Product A', fontsize=14, fontweight='bold', pad=15) ax.set_xlabel('Month', fontsize=12, labelpad=10) ax.set_ylabel('Units Sold', fontsize=12, labelpad=10) # Alternative: plt-level functions (for single plot) # plt.title('Monthly Sales'), plt.xlabel('Month'), plt.ylabel('Units Sold') plt.tight_layout() plt.show()

Annotations

Annotations highlight specific data points with text and arrows, drawing the viewer's attention to important findings.

# --- Annotations --- fig, ax = plt.subplots(figsize=(10, 5)) ax.plot(months, product_a, marker='o', linewidth=2, color='#2563eb') # plt.annotate() - adds text with an arrow pointing to data ax.annotate('Peak Sales', xy=(11, 250), # point to annotate (x, y) xytext=(8, 260), # text position fontsize=11, fontweight='bold', arrowprops=dict( arrowstyle='->', color='#dc2626', linewidth=2 ), color='#dc2626') # plt.text() - adds text at a specific position (no arrow) ax.text(3, 145, 'Growth Phase', fontsize=10, fontstyle='italic', color='gray', bbox=dict(boxstyle='round,pad=0.3', facecolor='#fef3c7', alpha=0.8)) ax.set_title('Sales Trend with Annotations', fontweight='bold') ax.set_xlabel('Month') ax.set_ylabel('Units Sold') plt.tight_layout() plt.show()

annotate() vs text()
plt.annotate() supports arrowprops to draw an arrow from the text to the data point. plt.text() places text at coordinates but has no arrow. Use annotate() when you need to point at a specific data value.

Colors, Legends, and Full Customization

Color Customization

Colors convey meaning and improve readability. Matplotlib supports named colors, hex codes, RGB tuples, and colormaps.

Legend Manipulation

Legends identify data series. You can control their position, font size, background, and border.

Axes and Tick Formatting

Customize axis limits, tick positions, tick labels, and gridlines for clarity.

# --- Full customization example --- import matplotlib.ticker as mticker fig, ax = plt.subplots(figsize=(10, 6)) # Plot data with custom colors ax.plot(months, product_a, marker='o', linewidth=2.5, color='#2563eb', markersize=8, label='Product A', markerfacecolor='white', markeredgewidth=2) ax.plot(months, product_b, marker='s', linewidth=2.5, color='#059669', markersize=8, label='Product B', markerfacecolor='white', markeredgewidth=2) # Title and labels ax.set_title('2024 Monthly Sales Comparison', fontsize=16, fontweight='bold', pad=20) ax.set_xlabel('Month', fontsize=12) ax.set_ylabel('Units Sold', fontsize=12) # Legend customization ax.legend( loc='upper left', # position fontsize=11, # font size framealpha=0.9, # background transparency edgecolor='#e2e8f0', # border color facecolor='white', # background color shadow=True, # drop shadow title='Products', # legend title title_fontsize=12 ) # Gridlines ax.grid(True, linestyle='--', alpha=0.4, color='gray') ax.set_axisbelow(True) # grid behind data # Axes formatting ax.set_ylim(0, 300) ax.yaxis.set_major_locator(mticker.MultipleLocator(50)) ax.tick_params(axis='both', labelsize=10) # Annotation on peak ax.annotate('Record High: 250 units', xy=(11, 250), xytext=(7, 270), fontsize=10, fontweight='bold', color='#dc2626', arrowprops=dict(arrowstyle='->', color='#dc2626', lw=1.5)) # Remove top and right spines for cleaner look ax.spines['top'].set_visible(False) ax.spines['right'].set_visible(False) plt.tight_layout() plt.show()

Seaborn Styles and Palettes

Seaborn provides built-in style presets and color palettes that instantly improve the appearance of all Matplotlib and Seaborn plots.

# --- Seaborn styles and palettes --- # Set the overall style sns.set_style('whitegrid') # Options: white, dark, whitegrid, darkgrid, ticks # Set a color palette globally sns.set_palette('Set2') # Options: deep, muted, bright, pastel, dark, Set1, Set2... # Use a specific palette in a single plot sns.boxplot(x='Group', y='Score', data=df, palette='husl') # Custom color palette custom_palette = ['#2563eb', '#059669', '#7c3aed', '#d97706'] sns.set_palette(custom_palette) # Available built-in styles: # 'white' - clean white background, no grid # 'dark' - dark gray background # 'whitegrid' - white background with grid lines # 'darkgrid' - gray background with grid lines (default) # 'ticks' - white background with ticks on axes # Reset to defaults sns.reset_defaults()

Style	Best For
`whitegrid`	Quantitative data that benefits from gridlines for reading values
`darkgrid`	Default style; good for presentations with many plot elements
`white`	Minimalist look for publication-quality figures
`ticks`	Clean look with axis ticks for precise reading
`dark`	Dark background; rarely used in print

Visual Exploration and Data-Driven Decisions

Visualizations are not just for presenting results — they are a critical tool during exploratory data analysis (EDA). Before running statistical tests or building models, use plots to:

Generate hypotheses: a scatterplot may reveal a correlation you did not expect, leading to a new research question
Detect data quality issues: histograms can reveal unexpected gaps or spikes, boxplots expose outliers
Understand distributions: check normality assumptions before applying parametric tests
Guide feature selection: heatmaps highlight which variables are strongly correlated
Communicate findings: the right visualization makes complex data accessible to non-technical stakeholders

Making Data-Driven Decisions
Always let the data guide your conclusions, not the other way around. A well-constructed visualization helps you see patterns objectively. When presenting, pair each visualization with a clear interpretation statement: "This chart shows that X increased by Y% during Z period, suggesting..."

Practice Quiz: Data Visualization Techniques

Q1. Which component of a boxplot represents the interquartile range (IQR)?

A) The whiskers extending from the box

B) The box itself, spanning from Q1 to Q3

C) The line inside the box

D) The individual points beyond the whiskers

The box in a boxplot spans from the first quartile (Q1) to the third quartile (Q3). The distance between Q1 and Q3 is the interquartile range (IQR). The line inside the box is the median, the whiskers extend up to 1.5 × IQR, and points beyond the whiskers are outliers.

Q2. What is the primary difference between a histogram and a bar chart?

A) Histograms use vertical bars; bar charts use horizontal bars

B) Histograms have colors; bar charts are grayscale

C) Histograms display continuous numeric data with adjacent bins; bar charts display categorical data with gaps between bars

D) There is no meaningful difference; the terms are interchangeable

Histograms are used for continuous numeric data. Values are grouped into bins and bars touch each other because the data is continuous. Bar charts represent discrete categorical data, and the bars have gaps between them to emphasize that categories are separate.

Q3. Which Seaborn function is used to create a correlation heatmap from a DataFrame?

A) sns.corrplot(df)

B) sns.heatmap(df.corr())

C) sns.correlation(df)

D) sns.pairplot(df.corr())

To create a correlation heatmap, first compute the correlation matrix with df.corr(), then pass it to sns.heatmap(). Common parameters include annot=True to show values and cmap='coolwarm' for a diverging color scheme.

Q4. You want to visualize the trend of monthly revenue over two years. Which chart type is most appropriate?

A) Pie chart

B) Boxplot

C) Line plot

D) Histogram

Line plots are the best choice for showing trends over time. They connect data points chronologically, making it easy to see increases, decreases, and seasonal patterns. A pie chart shows proportions, a boxplot shows distribution summaries, and a histogram shows frequency distributions.

Q5. What does the annot=True parameter do in sns.heatmap()?

A) Adds a title annotation above the heatmap

B) Adds axis labels automatically

C) Displays the numeric value inside each cell of the heatmap

D) Annotates outlier values with arrows

Setting annot=True in sns.heatmap() writes the numeric value of each cell directly on the heatmap. This makes it easier to read exact correlation coefficients. Use fmt='.2f' to control the number of decimal places displayed.

Q6. Which function adds text with an arrow pointing to a specific data point on a Matplotlib plot?

A) plt.text()

B) plt.annotate()

C) plt.label()

D) plt.arrow()

plt.annotate() (or ax.annotate()) is used to add text with an arrow pointing to a specific data point. It uses the xy parameter for the point location and xytext for the text position. plt.text() places text at coordinates but does not support arrows via arrowprops.

Q7. What is the correct way to create a 2-row, 3-column grid of subplots in Matplotlib?

A) fig, ax = plt.subplot(2, 3)

B) fig, axes = plt.subplots(2, 3)

C) fig = plt.figure(2, 3)

D) fig, axes = plt.grid(2, 3)

plt.subplots(nrows, ncols) creates a Figure and an array of Axes objects. For a 2×3 grid, fig, axes = plt.subplots(2, 3) returns axes as a 2D numpy array where you access individual plots with axes[row, col]. Note the plural "subplots" (not "subplot").

Q8. Which of the following is a major limitation of pie charts?

A) They cannot display more than two categories

B) Humans find it difficult to accurately compare angles and areas, especially for similar-sized slices

C) They can only display continuous numerical data

D) They require more computational resources than other chart types

The primary limitation of pie charts is that humans are poor at judging and comparing angles and areas. When slices are similar in size, it becomes nearly impossible to determine which is larger. This is why bar charts are generally preferred — the human visual system is much better at comparing lengths.

Q9. What does sns.set_style('whitegrid') do?

A) Sets the background color of all plots to white

B) Applies a Seaborn theme with a white background and visible grid lines to all subsequent plots

C) Adds a white grid overlay on top of the current plot only

D) Resets all Seaborn settings to default

sns.set_style('whitegrid') sets a global aesthetic theme for all subsequent plots. The 'whitegrid' style provides a clean white background with horizontal and vertical grid lines, which is excellent for quantitative data where precise value reading is important.

Q10. You want to explore the relationship between two numeric variables and also encode a third categorical variable using color. Which approach is best?

A) plt.hist(x, color=category)

B) sns.scatterplot(x='var1', y='var2', hue='category', data=df)

C) sns.heatmap(df[['var1', 'var2']])

D) plt.pie(df['var1'], labels=df['category'])

A scatterplot shows the relationship between two numeric variables, and the hue parameter in sns.scatterplot() maps a categorical variable to point color. This lets you see both the bivariate relationship and how it differs across groups — all in a single chart.

Previous: 4.2 Statistical Methods Next: 5.2 Communication of Insights

5.1 Data Visualization Techniques