Data cleaning is the foundational step in any data analysis workflow. Real-world datasets are rarely clean; they contain missing values, duplicates, inconsistent formatting, and erroneous entries. Pandas provides a rich set of tools for filtering, sorting, and managing these issues.
Handling Missing Values
Missing values appear as NaN (Not a Number) in Pandas. Two core methods let you handle them: dropna() to remove rows or columns with missing data, and fillna() to replace missing values with a specified value or strategy.
import pandas as pd
import numpy as np
# Create a sample DataFrame with missing values
df = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
'Age': [25, np.nan, 30, 22, np.nan],
'Score': [88.5, 92.0, np.nan, 76.0, 95.5],
'City': ['NY', 'LA', 'NY', np.nan, 'LA']
})
# Detect missing values
print(df.isnull()) # Boolean mask of NaN values
print(df.isnull().sum()) # Count NaN per column
# dropna() - Remove rows with any NaN
df_clean = df.dropna()
print(df_clean)
# Output: Only Alice and Eve remain (rows 0 and 4)
# dropna() with subset - Only consider specific columns
df_age_clean = df.dropna(subset=['Age'])
print(df_age_clean)
# Output: Drops rows where Age is NaN (Bob and Eve)
# fillna() - Replace NaN with a specific value
df['Age'] = df['Age'].fillna(df['Age'].mean())
print(df['Age'])
# NaN replaced with mean of [25, 30, 22] = 25.666...
# fillna() with method - Forward fill
df['City'] = df['City'].fillna(method='ffill')
print(df['City'])
# Diana's City filled with Charlie's City ('NY')
Sorting Data
The sort_values() method sorts a DataFrame by one or more columns. You can control sort direction and how NaN values are positioned.
# Sort by a single column
df_sorted = df.sort_values('Score', ascending=False)
print(df_sorted)
# Sort by multiple columns
df_multi = df.sort_values(['City', 'Score'], ascending=[True, False])
print(df_multi)
# Control NaN placement
df_nan_last = df.sort_values('Score', na_position='last')
print(df_nan_last)
# sort_index() - Sort by index
df_by_index = df.sort_index(ascending=False)
print(df_by_index)
Replacing Values
The replace() method substitutes values across a DataFrame or Series. It is essential for fixing inconsistent entries like typos or standardizing categories.
# Replace specific values
df['City'] = df['City'].replace('NY', 'New York')
df['City'] = df['City'].replace('LA', 'Los Angeles')
# Replace multiple values at once using a dictionary
df['City'] = df['City'].replace({
'New York': 'NYC',
'Los Angeles': 'LA'
})
# Replace across the entire DataFrame
df = df.replace('unknown', np.nan)
Filtering Data
Boolean conditions allow you to filter rows based on criteria. You can chain multiple conditions with & (and) and | (or).
# Filter rows where Score is above 85
high_scorers = df[df['Score'] > 85]
print(high_scorers)
# Multiple conditions (use parentheses!)
result = df[(df['Score'] > 80) & (df['Age'] < 28)]
print(result)
# Filter with isin()
nyc_la = df[df['City'].isin(['NYC', 'LA'])]
print(nyc_la)
# Filter with string methods
a_names = df[df['Name'].str.startswith('A')]
print(a_names)
Exam Tip:
Remember that dropna() and fillna() return new DataFrames by default. Use inplace=True or assign the result back to modify the original. Also note that fillna(method='ffill') propagates the last valid observation forward, while fillna(method='bfill') uses the next valid observation.
4.1.2 Merge and Reshape Datasets
Combining datasets from multiple sources is a common operation. Pandas offers merge() for SQL-style joins, concat() for stacking DataFrames, and pivot()/melt() for reshaping between wide and long formats.
pd.merge() - SQL-Style Joins
The merge() function joins two DataFrames on a common column or index, supporting inner, outer, left, and right join types.
# Create two DataFrames
employees = pd.DataFrame({
'emp_id': [1, 2, 3, 4],
'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
'dept_id': [10, 20, 10, 30]
})
departments = pd.DataFrame({
'dept_id': [10, 20, 40],
'dept_name': ['Engineering', 'Marketing', 'Sales']
})
# INNER JOIN - Only matching rows from both
inner = pd.merge(employees, departments, on='dept_id', how='inner')
print(inner)
# emp_id name dept_id dept_name
# 0 1 Alice 10 Engineering
# 1 3 Charlie 10 Engineering
# 2 2 Bob 20 Marketing
# Diana (dept 30) and Sales (dept 40) excluded
# LEFT JOIN - All rows from left, matching from right
left = pd.merge(employees, departments, on='dept_id', how='left')
print(left)
# Diana included with dept_name = NaN
# RIGHT JOIN - All rows from right, matching from left
right = pd.merge(employees, departments, on='dept_id', how='right')
print(right)
# Sales included with name = NaN
# OUTER JOIN - All rows from both DataFrames
outer = pd.merge(employees, departments, on='dept_id', how='outer')
print(outer)
# Both Diana and Sales included with NaN where no match
Join Type
Keeps From Left
Keeps From Right
Description
inner
Matching only
Matching only
Intersection of keys
left
All rows
Matching only
All left + matched right
right
Matching only
All rows
Matched left + all right
outer
All rows
All rows
Union of keys
pd.concat() - Stacking DataFrames
Use concat() to stack DataFrames vertically (rows) or horizontally (columns).
# Vertical stacking (axis=0, default)
df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'A': [5, 6], 'B': [7, 8]})
stacked = pd.concat([df1, df2], ignore_index=True)
print(stacked)
# A B
# 0 1 3
# 1 2 4
# 2 5 7
# 3 6 8
# Horizontal stacking (axis=1)
side_by_side = pd.concat([df1, df2], axis=1)
print(side_by_side)
# A B A B
# 0 1 3 5 7
# 1 2 4 6 8
# Handling mismatched columns
df3 = pd.DataFrame({'A': [9], 'C': [10]})
result = pd.concat([df1, df3], ignore_index=True)
print(result)
# A B C
# 0 1.0 3.0 NaN
# 1 2.0 4.0 NaN
# 2 9.0 NaN 10.0
pivot() and melt() - Reshaping Data
pivot() converts long-format data to wide-format, while melt() does the reverse. pivot_table() adds aggregation capability when duplicate entries exist.
# Long-format data
sales = pd.DataFrame({
'Date': ['2024-01', '2024-01', '2024-02', '2024-02'],
'Product': ['A', 'B', 'A', 'B'],
'Revenue': [100, 150, 120, 160]
})
# pivot() - Long to wide
wide = sales.pivot(index='Date', columns='Product', values='Revenue')
print(wide)
# Product A B
# Date
# 2024-01 100 150
# 2024-02 120 160
# melt() - Wide to long
long = wide.reset_index().melt(
id_vars='Date',
value_vars=['A', 'B'],
var_name='Product',
value_name='Revenue'
)
print(long)
# pivot_table() - With aggregation (handles duplicates)
sales_dup = pd.DataFrame({
'Date': ['2024-01', '2024-01', '2024-01', '2024-02'],
'Product': ['A', 'A', 'B', 'A'],
'Revenue': [100, 110, 150, 120]
})
pt = sales_dup.pivot_table(
index='Date',
columns='Product',
values='Revenue',
aggfunc='mean' # Aggregates duplicate entries
)
print(pt)
# Product A B
# Date
# 2024-01 105.0 150.0 (mean of 100 and 110)
# 2024-02 120.0 NaN
Key Difference:pivot() raises a ValueError if there are duplicate index-column pairs. Use pivot_table() with an aggfunc parameter when duplicates exist. Common aggfunc values: 'mean', 'sum', 'count', 'min', 'max'.
4.1.3 Series vs DataFrames
Understanding the relationship and differences between Pandas Series and DataFrames is fundamental. A Series is a one-dimensional labeled array, while a DataFrame is a two-dimensional labeled data structure made up of multiple Series sharing the same index.
Conceptual Differences
Feature
Series
DataFrame
Dimensions
1-D (single column)
2-D (rows and columns)
Structure
Index + Values
Index + Columns (each column is a Series)
Creation
pd.Series(data)
pd.DataFrame(data)
Accessing column
N/A
Returns a Series
Has .columns
No (has .name)
Yes
# Creating a Series
s = pd.Series([10, 20, 30], index=['a', 'b', 'c'], name='values')
print(s)
# a 10
# b 20
# c 30
# Name: values, dtype: int64
print(type(s)) # <class 'pandas.core.series.Series'>
print(s.shape) # (3,)
print(s.ndim) # 1
# Creating a DataFrame
df = pd.DataFrame({
'x': [1, 2, 3],
'y': [4, 5, 6]
})
print(type(df)) # <class 'pandas.core.frame.DataFrame'>
print(df.shape) # (3, 2)
print(df.ndim) # 2
# A DataFrame column is a Series
col = df['x']
print(type(col)) # <class 'pandas.core.series.Series'>
Both Series and DataFrames support vectorized operations, meaning operations are applied element-wise without explicit loops. This is far more efficient than iterating over rows.
# Vectorized arithmetic on Series
s = pd.Series([10, 20, 30])
print(s * 2) # Series([20, 40, 60])
print(s + 5) # Series([15, 25, 35])
print(s ** 2) # Series([100, 400, 900])
# Vectorized operations on DataFrames
df = pd.DataFrame({'a': [1, 2], 'b': [3, 4]})
print(df * 10)
# a b
# 0 10 30
# 1 20 40
# Applying functions
print(s.apply(lambda x: x ** 2 + 1)) # Series([101, 401, 901])
print(df.apply(np.sum, axis=0)) # Sum per column: a=3, b=7
print(df.apply(np.sum, axis=1)) # Sum per row: 0=4, 1=6
# String vectorized operations on Series
names = pd.Series(['alice', 'bob', 'charlie'])
print(names.str.upper()) # Series(['ALICE', 'BOB', 'CHARLIE'])
print(names.str.len()) # Series([5, 3, 7])
Remember:
When you select a single column from a DataFrame with df['col'], you get a Series. When you select multiple columns with df[['col1', 'col2']] (double brackets), you get a DataFrame. This distinction frequently appears on exams.
4.1.4 Data Access with .loc and .iloc
Pandas provides two primary accessors for selecting data: .loc for label-based indexing and .iloc for integer-based (positional) indexing. Understanding the difference is critical for the exam.
.loc - Label-Based Access
.loc selects data by row and column labels. Slicing with .loc is inclusive on both ends.
df = pd.DataFrame({
'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
'age': [25, 30, 35, 28],
'score': [88, 92, 85, 95]
}, index=['r1', 'r2', 'r3', 'r4'])
# Single row by label
print(df.loc['r2'])
# name Bob
# age 30
# score 92
# Single value
print(df.loc['r2', 'name']) # 'Bob'
# Row slice (INCLUSIVE on both ends)
print(df.loc['r1':'r3'])
# Returns rows r1, r2, AND r3
# Specific rows and columns
print(df.loc[['r1', 'r4'], ['name', 'score']])
# name score
# r1 Alice 88
# r4 Diana 95
# Setting values
df.loc['r2', 'score'] = 99
.iloc - Integer-Based Access
.iloc selects data by integer position (0-based). Slicing with .iloc is exclusive on the end (like Python lists).
# Single row by position
print(df.iloc[0])
# name Alice
# age 25
# score 88
# Single value
print(df.iloc[0, 1]) # 25 (row 0, column 1)
# Row slice (EXCLUSIVE on end, like Python)
print(df.iloc[0:3])
# Returns rows at position 0, 1, 2 (NOT 3)
# Specific rows and columns by position
print(df.iloc[[0, 3], [0, 2]])
# name score
# r1 Alice 88
# r4 Diana 95
# Last row
print(df.iloc[-1]) # Diana's row
Boolean Indexing with .loc
# Conditional selection with .loc
print(df.loc[df['age'] > 27])
# Returns Bob (30), Charlie (35), Diana (28)
# Conditional selection with specific columns
print(df.loc[df['score'] >= 90, ['name', 'score']])
# name score
# r2 Bob 99
# r4 Diana 95
# Complex boolean conditions
mask = (df['age'] < 30) & (df['score'] > 85)
print(df.loc[mask])
Critical Exam Point:.loc slicing is inclusive on both ends: df.loc['a':'c'] includes 'c'. .iloc slicing is exclusive on the end: df.iloc[0:3] returns positions 0, 1, 2 only. Confusing these is one of the most common exam mistakes.
Feature
.loc
.iloc
Based on
Labels (names)
Integer positions (0-based)
Slice end
Inclusive
Exclusive
Accepts booleans
Yes
Yes
Use case
Named rows/columns
Position-based selection
4.1.5 NumPy Array Operations
NumPy is the foundation of numerical computing in Python. Its core data structure, the ndarray, provides fast element-wise operations, broadcasting, and a rich set of mathematical functions. Pandas is built on top of NumPy.
Element-Wise Arithmetic
NumPy arrays support arithmetic operations that are applied element-by-element, unlike Python lists.
* Series can hold mixed types but is most efficient with a single dtype.
# Performance comparison
import time
size = 1_000_000
py_list = list(range(size))
np_array = np.arange(size)
# Python list: sum using a loop
start = time.time()
total = sum(py_list)
print(f"List sum: {time.time() - start:.4f}s")
# NumPy: vectorized sum
start = time.time()
total = np.sum(np_array)
print(f"NumPy sum: {time.time() - start:.4f}s")
# NumPy is typically 10-100x faster
Performance Insight:
NumPy arrays are stored in contiguous memory blocks and operations are implemented in optimized C code. This makes NumPy 10-100x faster than Python lists for numerical operations. Always prefer NumPy arrays over lists when performing mathematical computations.
4.1.6 GroupBy and Insights
The groupby() method is one of the most powerful tools in Pandas. It follows a split-apply-combine pattern: split data into groups, apply a function to each group, and combine the results back into a single structure.
Basic GroupBy Operations
df = pd.DataFrame({
'department': ['Sales', 'Sales', 'Engineering', 'Engineering', 'HR', 'HR'],
'employee': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve', 'Frank'],
'salary': [60000, 65000, 80000, 85000, 55000, 58000],
'experience': [3, 5, 7, 9, 2, 4]
})
# Group by department and calculate mean salary
print(df.groupby('department')['salary'].mean())
# department
# Engineering 82500.0
# HR 56500.0
# Sales 62500.0
# Multiple aggregations
print(df.groupby('department')['salary'].describe())
# Count per group
print(df.groupby('department').size())
# Sum per group
print(df.groupby('department')['salary'].sum())
The agg() Method
agg() lets you apply multiple aggregation functions at once, and even different functions to different columns.
# Multiple aggregations on one column
result = df.groupby('department')['salary'].agg(['mean', 'min', 'max', 'count'])
print(result)
# mean min max count
# department
# Engineering 82500.0 80000 85000 2
# HR 56500.0 55000 58000 2
# Sales 62500.0 60000 65000 2
# Different functions per column
result = df.groupby('department').agg({
'salary': ['mean', 'sum'],
'experience': ['mean', 'max']
})
print(result)
# Named aggregations (cleaner output)
result = df.groupby('department').agg(
avg_salary=('salary', 'mean'),
total_salary=('salary', 'sum'),
max_experience=('experience', 'max'),
headcount=('employee', 'count')
)
print(result)
# avg_salary total_salary max_experience headcount
# department
# Engineering 82500.0 165000 9 2
# HR 56500.0 113000 4 2
# Sales 62500.0 125000 5 2
Pivot Tables
orders = pd.DataFrame({
'region': ['East', 'East', 'West', 'West', 'East', 'West'],
'product': ['Widget', 'Gadget', 'Widget', 'Gadget', 'Widget', 'Widget'],
'sales': [200, 150, 300, 250, 180, 320]
})
# Pivot table with aggregation
pt = pd.pivot_table(
orders,
values='sales',
index='region',
columns='product',
aggfunc='sum'
)
print(pt)
# product Gadget Widget
# region
# East 150 380
# West 250 620
# Pivot table with margins (totals)
pt_margins = pd.pivot_table(
orders,
values='sales',
index='region',
columns='product',
aggfunc='sum',
margins=True,
margins_name='Total'
)
print(pt_margins)
# product Gadget Widget Total
# region
# East 150 380 530
# West 250 620 870
# Total 400 1000 1400
Cross-Tabulation with pd.crosstab()
pd.crosstab() computes a frequency table of two or more factors. It is similar to pivot_table() but defaults to counting occurrences.
survey = pd.DataFrame({
'gender': ['M', 'F', 'M', 'F', 'M', 'F', 'M', 'F'],
'preference': ['A', 'B', 'A', 'A', 'B', 'B', 'A', 'A']
})
# Basic cross-tabulation (frequency counts)
ct = pd.crosstab(survey['gender'], survey['preference'])
print(ct)
# preference A B
# gender
# F 2 2
# M 3 1
# Cross-tab with margins
ct_margins = pd.crosstab(
survey['gender'],
survey['preference'],
margins=True
)
print(ct_margins)
# preference A B All
# gender
# F 2 2 4
# M 3 1 4
# All 5 3 8
# Normalize to get proportions
ct_norm = pd.crosstab(
survey['gender'],
survey['preference'],
normalize='index' # Normalize across rows
)
print(ct_norm)
# preference A B
# gender
# F 0.50 0.50
# M 0.75 0.25
Extracting Insights Pattern:
A typical data analysis workflow combines groupby with multiple techniques:
df.groupby('category')['value'].agg(['mean', 'count']).sort_values('mean', ascending=False).
This groups the data, computes aggregations, and sorts to reveal the top-performing category.
Practice Quiz: Data Analysis with Pandas and NumPy
Test your understanding with these 10 questions. Click on an option to check your answer.
Q1: What does df.dropna(subset=['Age']) do?
Drops the 'Age' column from the DataFrame
Drops rows where the 'Age' column has NaN values
Fills NaN values in the 'Age' column with 0
Drops all rows that contain any NaN value
The subset parameter in dropna() specifies which columns to check for NaN values. Only rows where 'Age' is NaN are removed. Without subset, rows with NaN in any column would be dropped.
Q2: Given df with index labels ['a', 'b', 'c', 'd'], what does df.loc['a':'c'] return?
Rows 'a' and 'b' only
Rows 'a', 'b', and 'c'
Rows 'a', 'b', 'c', and 'd'
A KeyError because slicing is not supported with .loc
.loc uses label-based slicing, which is inclusive on both ends. So 'a':'c' includes rows labeled 'a', 'b', and 'c'. This is different from .iloc, which uses exclusive end slicing like standard Python.
Q3: What is the result of np.array([1, 2, 3]) * 2?
[1, 2, 3, 1, 2, 3]
[2, 2, 2]
array([2, 4, 6])
A TypeError
NumPy arrays perform element-wise operations. Multiplying by a scalar multiplies each element individually, giving array([2, 4, 6]). A Python list would repeat instead: [1, 2, 3] * 2 gives [1, 2, 3, 1, 2, 3].
Q4: Which method is used to combine two DataFrames with an SQL-style join?
pd.concat()
pd.merge()
df.append()
df.join_sql()
pd.merge() performs SQL-style joins (inner, outer, left, right) on one or more keys. pd.concat() is used for stacking DataFrames along an axis, not for key-based joining.
Q5: What does df.iloc[0:3] return if the DataFrame has 5 rows?
The first 3 rows (positions 0, 1, 2)
The first 4 rows (positions 0, 1, 2, 3)
Only the row at position 3
All 5 rows
.iloc uses integer-based indexing with Python-style slicing, where the end index is exclusive. So iloc[0:3] returns rows at positions 0, 1, and 2 (three rows total).
Q6: What happens when you call pivot() on data that has duplicate index-column pairs?
It silently takes the first value
It averages the duplicate values
It raises a ValueError
It concatenates the duplicate values into a list
pivot() cannot handle duplicate entries and raises a ValueError. Use pivot_table() with an aggfunc parameter (e.g., 'mean', 'sum') to aggregate duplicate values.
Q7: What does df.groupby('dept')['salary'].agg(['mean', 'count']) return?
A Series with the mean salary per department
A DataFrame with mean and count columns grouped by department
A single float value representing the overall mean
A GroupBy object that needs further iteration
When you pass a list of aggregation functions to agg(), the result is a DataFrame with one column per function. The index contains the group labels (department names), and the columns are 'mean' and 'count'.
Q8: Which of the following correctly describes NumPy broadcasting?
It copies the smaller array multiple times in memory to match the larger array
It stretches the smaller array virtually to match dimensions, without extra memory copies
It only works when both arrays have identical shapes
It always raises an error when array shapes differ
Broadcasting is a memory-efficient mechanism where NumPy virtually stretches the smaller array to match the larger one without actually copying data. It works when dimensions are compatible (either equal or one of them is 1).
Q9: What is the output of type(df['column_name']) where df is a Pandas DataFrame?
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.series.Series'>
<class 'numpy.ndarray'>
<class 'list'>
Selecting a single column from a DataFrame using bracket notation (df['col']) returns a Pandas Series. To get a DataFrame instead, use double brackets: df[['col']].
Q10: Which function converts a wide-format DataFrame to long format?
df.pivot()
pd.merge()
df.melt()
df.stack()
melt() unpivots a wide-format DataFrame into long format. It takes id_vars (columns to keep), value_vars (columns to unpivot), var_name, and value_name as parameters. pivot() does the opposite: long to wide.