Topic 4.1: Data Analysis with Pandas and NumPy

Learning Objectives (6)

4.1.1 Organize and Clean Data with Pandas
4.1.2 Merge and Reshape Datasets
4.1.3 Series vs DataFrames
4.1.4 Data Access with .loc and .iloc
4.1.5 NumPy Array Operations
4.1.6 GroupBy and Insights

4.1.1 Organize and Clean Data with Pandas

Data cleaning is the foundational step in any data analysis workflow. Real-world datasets are rarely clean; they contain missing values, duplicates, inconsistent formatting, and erroneous entries. Pandas provides a rich set of tools for filtering, sorting, and managing these issues.

Handling Missing Values

Missing values appear as NaN (Not a Number) in Pandas. Two core methods let you handle them: dropna() to remove rows or columns with missing data, and fillna() to replace missing values with a specified value or strategy.

import pandas as pd import numpy as np # Create a sample DataFrame with missing values df = pd.DataFrame({ 'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'], 'Age': [25, np.nan, 30, 22, np.nan], 'Score': [88.5, 92.0, np.nan, 76.0, 95.5], 'City': ['NY', 'LA', 'NY', np.nan, 'LA'] }) # Detect missing values print(df.isnull()) # Boolean mask of NaN values print(df.isnull().sum()) # Count NaN per column # dropna() - Remove rows with any NaN df_clean = df.dropna() print(df_clean) # Output: Only Alice and Eve remain (rows 0 and 4) # dropna() with subset - Only consider specific columns df_age_clean = df.dropna(subset=['Age']) print(df_age_clean) # Output: Drops rows where Age is NaN (Bob and Eve) # fillna() - Replace NaN with a specific value df['Age'] = df['Age'].fillna(df['Age'].mean()) print(df['Age']) # NaN replaced with mean of [25, 30, 22] = 25.666... # fillna() with method - Forward fill df['City'] = df['City'].fillna(method='ffill') print(df['City']) # Diana's City filled with Charlie's City ('NY')

Sorting Data

The sort_values() method sorts a DataFrame by one or more columns. You can control sort direction and how NaN values are positioned.

# Sort by a single column df_sorted = df.sort_values('Score', ascending=False) print(df_sorted) # Sort by multiple columns df_multi = df.sort_values(['City', 'Score'], ascending=[True, False]) print(df_multi) # Control NaN placement df_nan_last = df.sort_values('Score', na_position='last') print(df_nan_last) # sort_index() - Sort by index df_by_index = df.sort_index(ascending=False) print(df_by_index)

Replacing Values

The replace() method substitutes values across a DataFrame or Series. It is essential for fixing inconsistent entries like typos or standardizing categories.

# Replace specific values df['City'] = df['City'].replace('NY', 'New York') df['City'] = df['City'].replace('LA', 'Los Angeles') # Replace multiple values at once using a dictionary df['City'] = df['City'].replace({ 'New York': 'NYC', 'Los Angeles': 'LA' }) # Replace across the entire DataFrame df = df.replace('unknown', np.nan)

Filtering Data

Boolean conditions allow you to filter rows based on criteria. You can chain multiple conditions with & (and) and | (or).

# Filter rows where Score is above 85 high_scorers = df[df['Score'] > 85] print(high_scorers) # Multiple conditions (use parentheses!) result = df[(df['Score'] > 80) & (df['Age'] < 28)] print(result) # Filter with isin() nyc_la = df[df['City'].isin(['NYC', 'LA'])] print(nyc_la) # Filter with string methods a_names = df[df['Name'].str.startswith('A')] print(a_names)

Exam Tip: Remember that dropna() and fillna() return new DataFrames by default. Use inplace=True or assign the result back to modify the original. Also note that fillna(method='ffill') propagates the last valid observation forward, while fillna(method='bfill') uses the next valid observation.

4.1.2 Merge and Reshape Datasets

Combining datasets from multiple sources is a common operation. Pandas offers merge() for SQL-style joins, concat() for stacking DataFrames, and pivot()/melt() for reshaping between wide and long formats.

pd.merge() - SQL-Style Joins

The merge() function joins two DataFrames on a common column or index, supporting inner, outer, left, and right join types.

# Create two DataFrames employees = pd.DataFrame({ 'emp_id': [1, 2, 3, 4], 'name': ['Alice', 'Bob', 'Charlie', 'Diana'], 'dept_id': [10, 20, 10, 30] }) departments = pd.DataFrame({ 'dept_id': [10, 20, 40], 'dept_name': ['Engineering', 'Marketing', 'Sales'] }) # INNER JOIN - Only matching rows from both inner = pd.merge(employees, departments, on='dept_id', how='inner') print(inner) # emp_id name dept_id dept_name # 0 1 Alice 10 Engineering # 1 3 Charlie 10 Engineering # 2 2 Bob 20 Marketing # Diana (dept 30) and Sales (dept 40) excluded # LEFT JOIN - All rows from left, matching from right left = pd.merge(employees, departments, on='dept_id', how='left') print(left) # Diana included with dept_name = NaN # RIGHT JOIN - All rows from right, matching from left right = pd.merge(employees, departments, on='dept_id', how='right') print(right) # Sales included with name = NaN # OUTER JOIN - All rows from both DataFrames outer = pd.merge(employees, departments, on='dept_id', how='outer') print(outer) # Both Diana and Sales included with NaN where no match

Join Type	Keeps From Left	Keeps From Right	Description
`inner`	Matching only	Matching only	Intersection of keys
`left`	All rows	Matching only	All left + matched right
`right`	Matching only	All rows	Matched left + all right
`outer`	All rows	All rows	Union of keys

pd.concat() - Stacking DataFrames

Use concat() to stack DataFrames vertically (rows) or horizontally (columns).

# Vertical stacking (axis=0, default) df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]}) df2 = pd.DataFrame({'A': [5, 6], 'B': [7, 8]}) stacked = pd.concat([df1, df2], ignore_index=True) print(stacked) # A B # 0 1 3 # 1 2 4 # 2 5 7 # 3 6 8 # Horizontal stacking (axis=1) side_by_side = pd.concat([df1, df2], axis=1) print(side_by_side) # A B A B # 0 1 3 5 7 # 1 2 4 6 8 # Handling mismatched columns df3 = pd.DataFrame({'A': [9], 'C': [10]}) result = pd.concat([df1, df3], ignore_index=True) print(result) # A B C # 0 1.0 3.0 NaN # 1 2.0 4.0 NaN # 2 9.0 NaN 10.0

pivot() and melt() - Reshaping Data

pivot() converts long-format data to wide-format, while melt() does the reverse. pivot_table() adds aggregation capability when duplicate entries exist.

# Long-format data sales = pd.DataFrame({ 'Date': ['2024-01', '2024-01', '2024-02', '2024-02'], 'Product': ['A', 'B', 'A', 'B'], 'Revenue': [100, 150, 120, 160] }) # pivot() - Long to wide wide = sales.pivot(index='Date', columns='Product', values='Revenue') print(wide) # Product A B # Date # 2024-01 100 150 # 2024-02 120 160 # melt() - Wide to long long = wide.reset_index().melt( id_vars='Date', value_vars=['A', 'B'], var_name='Product', value_name='Revenue' ) print(long) # pivot_table() - With aggregation (handles duplicates) sales_dup = pd.DataFrame({ 'Date': ['2024-01', '2024-01', '2024-01', '2024-02'], 'Product': ['A', 'A', 'B', 'A'], 'Revenue': [100, 110, 150, 120] }) pt = sales_dup.pivot_table( index='Date', columns='Product', values='Revenue', aggfunc='mean' # Aggregates duplicate entries ) print(pt) # Product A B # Date # 2024-01 105.0 150.0 (mean of 100 and 110) # 2024-02 120.0 NaN

Key Difference: pivot() raises a ValueError if there are duplicate index-column pairs. Use pivot_table() with an aggfunc parameter when duplicates exist. Common aggfunc values: 'mean', 'sum', 'count', 'min', 'max'.

4.1.3 Series vs DataFrames

Understanding the relationship and differences between Pandas Series and DataFrames is fundamental. A Series is a one-dimensional labeled array, while a DataFrame is a two-dimensional labeled data structure made up of multiple Series sharing the same index.

Conceptual Differences

Feature	Series	DataFrame
Dimensions	1-D (single column)	2-D (rows and columns)
Structure	Index + Values	Index + Columns (each column is a Series)
Creation	`pd.Series(data)`	`pd.DataFrame(data)`
Accessing column	N/A	Returns a Series
Has `.columns`	No (has `.name`)	Yes

# Creating a Series s = pd.Series([10, 20, 30], index=['a', 'b', 'c'], name='values') print(s) # a 10 # b 20 # c 30 # Name: values, dtype: int64 print(type(s)) # <class 'pandas.core.series.Series'> print(s.shape) # (3,) print(s.ndim) # 1 # Creating a DataFrame df = pd.DataFrame({ 'x': [1, 2, 3], 'y': [4, 5, 6] }) print(type(df)) # <class 'pandas.core.frame.DataFrame'> print(df.shape) # (3, 2) print(df.ndim) # 2 # A DataFrame column is a Series col = df['x'] print(type(col)) # <class 'pandas.core.series.Series'>

Indexing Techniques

# Series indexing s = pd.Series([10, 20, 30, 40], index=['a', 'b', 'c', 'd']) print(s['b']) # 20 (label-based) print(s[1]) # 20 (positional - use with caution) print(s['a':'c']) # a=10, b=20, c=30 (label slicing is inclusive) # DataFrame indexing df = pd.DataFrame({ 'name': ['Alice', 'Bob', 'Charlie'], 'age': [25, 30, 35], 'score': [88, 92, 85] }, index=['r1', 'r2', 'r3']) print(df['name']) # Returns Series print(df[['name', 'age']]) # Returns DataFrame (double brackets)

Vectorized Operations

Both Series and DataFrames support vectorized operations, meaning operations are applied element-wise without explicit loops. This is far more efficient than iterating over rows.

# Vectorized arithmetic on Series s = pd.Series([10, 20, 30]) print(s * 2) # Series([20, 40, 60]) print(s + 5) # Series([15, 25, 35]) print(s ** 2) # Series([100, 400, 900]) # Vectorized operations on DataFrames df = pd.DataFrame({'a': [1, 2], 'b': [3, 4]}) print(df * 10) # a b # 0 10 30 # 1 20 40 # Applying functions print(s.apply(lambda x: x ** 2 + 1)) # Series([101, 401, 901]) print(df.apply(np.sum, axis=0)) # Sum per column: a=3, b=7 print(df.apply(np.sum, axis=1)) # Sum per row: 0=4, 1=6 # String vectorized operations on Series names = pd.Series(['alice', 'bob', 'charlie']) print(names.str.upper()) # Series(['ALICE', 'BOB', 'CHARLIE']) print(names.str.len()) # Series([5, 3, 7])

Remember: When you select a single column from a DataFrame with df['col'], you get a Series. When you select multiple columns with df[['col1', 'col2']] (double brackets), you get a DataFrame. This distinction frequently appears on exams.

4.1.4 Data Access with .loc and .iloc

Pandas provides two primary accessors for selecting data: .loc for label-based indexing and .iloc for integer-based (positional) indexing. Understanding the difference is critical for the exam.

.loc - Label-Based Access

.loc selects data by row and column labels. Slicing with .loc is inclusive on both ends.

df = pd.DataFrame({ 'name': ['Alice', 'Bob', 'Charlie', 'Diana'], 'age': [25, 30, 35, 28], 'score': [88, 92, 85, 95] }, index=['r1', 'r2', 'r3', 'r4']) # Single row by label print(df.loc['r2']) # name Bob # age 30 # score 92 # Single value print(df.loc['r2', 'name']) # 'Bob' # Row slice (INCLUSIVE on both ends) print(df.loc['r1':'r3']) # Returns rows r1, r2, AND r3 # Specific rows and columns print(df.loc[['r1', 'r4'], ['name', 'score']]) # name score # r1 Alice 88 # r4 Diana 95 # Setting values df.loc['r2', 'score'] = 99

.iloc - Integer-Based Access

.iloc selects data by integer position (0-based). Slicing with .iloc is exclusive on the end (like Python lists).

# Single row by position print(df.iloc[0]) # name Alice # age 25 # score 88 # Single value print(df.iloc[0, 1]) # 25 (row 0, column 1) # Row slice (EXCLUSIVE on end, like Python) print(df.iloc[0:3]) # Returns rows at position 0, 1, 2 (NOT 3) # Specific rows and columns by position print(df.iloc[[0, 3], [0, 2]]) # name score # r1 Alice 88 # r4 Diana 95 # Last row print(df.iloc[-1]) # Diana's row

Boolean Indexing with .loc

# Conditional selection with .loc print(df.loc[df['age'] > 27]) # Returns Bob (30), Charlie (35), Diana (28) # Conditional selection with specific columns print(df.loc[df['score'] >= 90, ['name', 'score']]) # name score # r2 Bob 99 # r4 Diana 95 # Complex boolean conditions mask = (df['age'] < 30) & (df['score'] > 85) print(df.loc[mask])

Critical Exam Point: .loc slicing is inclusive on both ends: df.loc['a':'c'] includes 'c'. .iloc slicing is exclusive on the end: df.iloc[0:3] returns positions 0, 1, 2 only. Confusing these is one of the most common exam mistakes.

Feature	`.loc`	`.iloc`
Based on	Labels (names)	Integer positions (0-based)
Slice end	Inclusive	Exclusive
Accepts booleans	Yes	Yes
Use case	Named rows/columns	Position-based selection

4.1.5 NumPy Array Operations

NumPy is the foundation of numerical computing in Python. Its core data structure, the ndarray, provides fast element-wise operations, broadcasting, and a rich set of mathematical functions. Pandas is built on top of NumPy.

Element-Wise Arithmetic

NumPy arrays support arithmetic operations that are applied element-by-element, unlike Python lists.

import numpy as np # Create arrays a = np.array([1, 2, 3, 4]) b = np.array([10, 20, 30, 40]) # Element-wise operations print(a + b) # array([11, 22, 33, 44]) print(a * b) # array([ 10, 40, 90, 160]) print(a ** 2) # array([ 1, 4, 9, 16]) print(b / a) # array([10., 10., 10., 10.]) print(a > 2) # array([False, False, True, True]) # Compare with Python lists (VERY different!) lst = [1, 2, 3] print(lst * 2) # [1, 2, 3, 1, 2, 3] (repeats list!) arr = np.array([1, 2, 3]) print(arr * 2) # array([2, 4, 6]) (element-wise!)

Broadcasting

Broadcasting allows NumPy to perform operations on arrays of different shapes by automatically expanding the smaller array to match the larger one.

# Scalar broadcasting a = np.array([1, 2, 3]) print(a + 10) # array([11, 12, 13]) - scalar broadcast to all elements # 2D array broadcasting matrix = np.array([[1, 2, 3], [4, 5, 6]]) # Shape: (2, 3) row = np.array([10, 20, 30]) # Shape: (3,) print(matrix + row) # array([[11, 22, 33], # [14, 25, 36]]) # Row broadcast across both rows of the matrix # Column broadcasting (requires reshaping) col = np.array([[100], [200]]) # Shape: (2, 1) print(matrix + col) # array([[101, 102, 103], # [204, 205, 206]])

Broadcasting Rules:

If arrays have different numbers of dimensions, the shape of the smaller array is padded with 1s on the left.
Arrays with a size of 1 along a dimension are stretched to match the other array's size along that dimension.
If sizes disagree and neither is 1, broadcasting fails with a ValueError.

Aggregation Functions

data = np.array([15, 22, 8, 31, 19, 27]) print(np.sum(data)) # 122 print(np.mean(data)) # 20.333... print(np.std(data)) # 7.586... (population std) print(np.min(data)) # 8 print(np.max(data)) # 31 print(np.median(data)) # 20.5 # Axis-based aggregation on 2D arrays matrix = np.array([[1, 2, 3], [4, 5, 6]]) print(np.sum(matrix, axis=0)) # array([5, 7, 9]) - sum down columns print(np.sum(matrix, axis=1)) # array([ 6, 15]) - sum across rows print(np.mean(matrix, axis=0)) # array([2.5, 3.5, 4.5])

Key Differences: Arrays, Lists, Series, DataFrames, NDArrays

Type	Library	Dimensions	Homogeneous?	Key Feature
Python `list`	Built-in	1-D (nestable)	No	Flexible, slow for math
NumPy `ndarray`	NumPy	N-D	Yes	Fast element-wise ops
Pandas `Series`	Pandas	1-D	Yes*	Labeled index
Pandas `DataFrame`	Pandas	2-D	Per column	Labeled rows + columns

* Series can hold mixed types but is most efficient with a single dtype.

# Performance comparison import time size = 1_000_000 py_list = list(range(size)) np_array = np.arange(size) # Python list: sum using a loop start = time.time() total = sum(py_list) print(f"List sum: {time.time() - start:.4f}s") # NumPy: vectorized sum start = time.time() total = np.sum(np_array) print(f"NumPy sum: {time.time() - start:.4f}s") # NumPy is typically 10-100x faster

Performance Insight: NumPy arrays are stored in contiguous memory blocks and operations are implemented in optimized C code. This makes NumPy 10-100x faster than Python lists for numerical operations. Always prefer NumPy arrays over lists when performing mathematical computations.

4.1.6 GroupBy and Insights

The groupby() method is one of the most powerful tools in Pandas. It follows a split-apply-combine pattern: split data into groups, apply a function to each group, and combine the results back into a single structure.

Basic GroupBy Operations

df = pd.DataFrame({ 'department': ['Sales', 'Sales', 'Engineering', 'Engineering', 'HR', 'HR'], 'employee': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve', 'Frank'], 'salary': [60000, 65000, 80000, 85000, 55000, 58000], 'experience': [3, 5, 7, 9, 2, 4] }) # Group by department and calculate mean salary print(df.groupby('department')['salary'].mean()) # department # Engineering 82500.0 # HR 56500.0 # Sales 62500.0 # Multiple aggregations print(df.groupby('department')['salary'].describe()) # Count per group print(df.groupby('department').size()) # Sum per group print(df.groupby('department')['salary'].sum())

The agg() Method

agg() lets you apply multiple aggregation functions at once, and even different functions to different columns.

# Multiple aggregations on one column result = df.groupby('department')['salary'].agg(['mean', 'min', 'max', 'count']) print(result) # mean min max count # department # Engineering 82500.0 80000 85000 2 # HR 56500.0 55000 58000 2 # Sales 62500.0 60000 65000 2 # Different functions per column result = df.groupby('department').agg({ 'salary': ['mean', 'sum'], 'experience': ['mean', 'max'] }) print(result) # Named aggregations (cleaner output) result = df.groupby('department').agg( avg_salary=('salary', 'mean'), total_salary=('salary', 'sum'), max_experience=('experience', 'max'), headcount=('employee', 'count') ) print(result) # avg_salary total_salary max_experience headcount # department # Engineering 82500.0 165000 9 2 # HR 56500.0 113000 4 2 # Sales 62500.0 125000 5 2

Pivot Tables

orders = pd.DataFrame({ 'region': ['East', 'East', 'West', 'West', 'East', 'West'], 'product': ['Widget', 'Gadget', 'Widget', 'Gadget', 'Widget', 'Widget'], 'sales': [200, 150, 300, 250, 180, 320] }) # Pivot table with aggregation pt = pd.pivot_table( orders, values='sales', index='region', columns='product', aggfunc='sum' ) print(pt) # product Gadget Widget # region # East 150 380 # West 250 620 # Pivot table with margins (totals) pt_margins = pd.pivot_table( orders, values='sales', index='region', columns='product', aggfunc='sum', margins=True, margins_name='Total' ) print(pt_margins) # product Gadget Widget Total # region # East 150 380 530 # West 250 620 870 # Total 400 1000 1400

Cross-Tabulation with pd.crosstab()

pd.crosstab() computes a frequency table of two or more factors. It is similar to pivot_table() but defaults to counting occurrences.

survey = pd.DataFrame({ 'gender': ['M', 'F', 'M', 'F', 'M', 'F', 'M', 'F'], 'preference': ['A', 'B', 'A', 'A', 'B', 'B', 'A', 'A'] }) # Basic cross-tabulation (frequency counts) ct = pd.crosstab(survey['gender'], survey['preference']) print(ct) # preference A B # gender # F 2 2 # M 3 1 # Cross-tab with margins ct_margins = pd.crosstab( survey['gender'], survey['preference'], margins=True ) print(ct_margins) # preference A B All # gender # F 2 2 4 # M 3 1 4 # All 5 3 8 # Normalize to get proportions ct_norm = pd.crosstab( survey['gender'], survey['preference'], normalize='index' # Normalize across rows ) print(ct_norm) # preference A B # gender # F 0.50 0.50 # M 0.75 0.25

Extracting Insights Pattern: A typical data analysis workflow combines groupby with multiple techniques: df.groupby('category')['value'].agg(['mean', 'count']).sort_values('mean', ascending=False). This groups the data, computes aggregations, and sorts to reveal the top-performing category.

Practice Quiz: Data Analysis with Pandas and NumPy

Test your understanding with these 10 questions. Click on an option to check your answer.

Q1: What does df.dropna(subset=['Age']) do?

Drops the 'Age' column from the DataFrame

Drops rows where the 'Age' column has NaN values

Fills NaN values in the 'Age' column with 0

Drops all rows that contain any NaN value

The subset parameter in dropna() specifies which columns to check for NaN values. Only rows where 'Age' is NaN are removed. Without subset, rows with NaN in any column would be dropped.

Q2: Given df with index labels ['a', 'b', 'c', 'd'], what does df.loc['a':'c'] return?

Rows 'a' and 'b' only

Rows 'a', 'b', and 'c'

Rows 'a', 'b', 'c', and 'd'

A KeyError because slicing is not supported with .loc

.loc uses label-based slicing, which is inclusive on both ends. So 'a':'c' includes rows labeled 'a', 'b', and 'c'. This is different from .iloc, which uses exclusive end slicing like standard Python.

Q3: What is the result of np.array([1, 2, 3]) * 2?

[1, 2, 3, 1, 2, 3]

[2, 2, 2]

array([2, 4, 6])

A TypeError

NumPy arrays perform element-wise operations. Multiplying by a scalar multiplies each element individually, giving array([2, 4, 6]). A Python list would repeat instead: [1, 2, 3] * 2 gives [1, 2, 3, 1, 2, 3].

Q4: Which method is used to combine two DataFrames with an SQL-style join?

pd.concat()

pd.merge()

df.append()

df.join_sql()

pd.merge() performs SQL-style joins (inner, outer, left, right) on one or more keys. pd.concat() is used for stacking DataFrames along an axis, not for key-based joining.

Q5: What does df.iloc[0:3] return if the DataFrame has 5 rows?

The first 3 rows (positions 0, 1, 2)

The first 4 rows (positions 0, 1, 2, 3)

Only the row at position 3

All 5 rows

.iloc uses integer-based indexing with Python-style slicing, where the end index is exclusive. So iloc[0:3] returns rows at positions 0, 1, and 2 (three rows total).

Q6: What happens when you call pivot() on data that has duplicate index-column pairs?

It silently takes the first value

It averages the duplicate values

It raises a ValueError

It concatenates the duplicate values into a list

pivot() cannot handle duplicate entries and raises a ValueError. Use pivot_table() with an aggfunc parameter (e.g., 'mean', 'sum') to aggregate duplicate values.

Q7: What does df.groupby('dept')['salary'].agg(['mean', 'count']) return?

A Series with the mean salary per department

A DataFrame with mean and count columns grouped by department

A single float value representing the overall mean

A GroupBy object that needs further iteration

When you pass a list of aggregation functions to agg(), the result is a DataFrame with one column per function. The index contains the group labels (department names), and the columns are 'mean' and 'count'.

Q8: Which of the following correctly describes NumPy broadcasting?

It copies the smaller array multiple times in memory to match the larger array

It stretches the smaller array virtually to match dimensions, without extra memory copies

It only works when both arrays have identical shapes

It always raises an error when array shapes differ

Broadcasting is a memory-efficient mechanism where NumPy virtually stretches the smaller array to match the larger one without actually copying data. It works when dimensions are compatible (either equal or one of them is 1).

Q9: What is the output of type(df['column_name']) where df is a Pandas DataFrame?

<class 'pandas.core.frame.DataFrame'>

<class 'pandas.core.series.Series'>

<class 'numpy.ndarray'>

<class 'list'>

Selecting a single column from a DataFrame using bracket notation (df['col']) returns a Pandas Series. To get a DataFrame instead, use double brackets: df[['col']].

Q10: Which function converts a wide-format DataFrame to long format?

df.pivot()