Topic 1.1: Data Collection, Integration, and Storage

1.1.1 Data Collection Methods

Data collection is the systematic process of gathering information from relevant sources to answer research questions, test hypotheses, or inform business decisions. The quality of any data analysis is fundamentally limited by the quality of the data collected. As the saying goes: "Garbage in, garbage out."

Data analysts must understand the full spectrum of collection methods, their trade-offs, and the ethical boundaries that govern them. In the exam, you are expected to know the differences between primary and secondary data collection, the advantages and limitations of each method, and how the choice of method affects downstream analysis.

Surveys, Interviews, and Web Scraping

Surveys

Surveys are one of the most widely used data collection instruments. They allow researchers to gather data from a large number of respondents efficiently. Surveys can be administered online (Google Forms, SurveyMonkey), by phone, by mail, or in person.

Closed-ended questions: Provide predefined answer options (e.g., Likert scales, multiple choice). Easy to quantify and analyze statistically.
Open-ended questions: Allow respondents to answer in their own words. Produce qualitative data that requires text analysis or manual coding.
Advantages: Cost-effective for large samples, standardized responses, easy to analyze quantitatively.
Limitations: Response bias, low response rates, limited depth of understanding, cannot capture context or body language.

Interviews

Interviews involve direct conversation between the researcher and the respondent. They provide richer, more nuanced data than surveys but are more time-consuming and expensive.

Structured interviews: Follow a rigid script of predetermined questions asked in a fixed order. Produces consistent, comparable data across respondents.
Semi-structured interviews: Have a guide of key topics but allow the interviewer to probe deeper based on responses. Balance between consistency and flexibility.
Unstructured interviews: Open-ended conversations with minimal predefined questions. Best for exploratory research where the topic is not well understood.
Advantages: Rich qualitative data, ability to clarify ambiguous responses, captures non-verbal cues (in person).
Limitations: Time-intensive, expensive, potential interviewer bias, difficult to scale, harder to quantify.

Web Scraping

Web scraping is the automated extraction of data from websites. It is particularly useful when data is publicly available on the web but not provided through a formal API or downloadable dataset.

Common tools: Python's BeautifulSoup (for parsing HTML), requests (for making HTTP requests), Scrapy (a full scraping framework), and Selenium (for JavaScript-rendered pages).
Typical use cases: Extracting product prices from e-commerce sites, gathering news articles, collecting job listings, monitoring social media mentions.
Advantages: Access to large amounts of publicly available data, automatable, can be scheduled to run periodically.
Limitations: Website structure changes can break scrapers, legal and ethical concerns (Terms of Service), rate limiting, potential for IP blocking.

Study Tip

This course expects you to know that BeautifulSoup parses HTML/XML content and that the requests library is used to fetch web pages. Know the difference between scraping and using an API.

Representative Sampling and Challenges in Data Collection

A sample is a subset of a larger population that is selected for study. The goal of sampling is to obtain a group that accurately represents the population so that findings can be generalized.

Types of Sampling

Sampling Method	Description	Pros	Cons
Simple Random	Every member of the population has an equal chance of being selected	Eliminates selection bias	Requires a complete list of the population
Stratified	Population is divided into subgroups (strata), and random samples are drawn from each stratum	Ensures representation of all subgroups	Requires knowledge of population characteristics
Cluster	Population is divided into clusters (often geographic), and entire clusters are randomly selected	Practical for geographically dispersed populations	Higher sampling error than simple random
Convenience	Selecting individuals who are easiest to reach	Fast and inexpensive	High risk of bias; results not generalizable
Systematic	Every k-th individual is selected from an ordered list	Simple to implement	Can introduce bias if there is a pattern in the list

Common Challenges in Data Collection

Non-response bias: When the people who do not respond differ systematically from those who do. For example, a health survey might under-represent very busy or very ill individuals.
Selection bias: When the sample does not represent the population. For instance, conducting an online survey excludes people without internet access.
Measurement error: Inaccurate data due to faulty instruments, poorly worded questions, or respondent misunderstanding.
Social desirability bias: Respondents give answers they think are socially acceptable rather than truthful.
Missing data: Incomplete responses that can skew results if not handled properly (deletion, imputation, etc.).
Data decay: Information becomes outdated over time, reducing its relevance and accuracy.

Common Pitfall

Convenience sampling is the most common form of biased sampling. While it is fast, the results cannot be generalized to the broader population. For exam questions, always consider whether the sampling method introduces bias.

Qualitative vs. Quantitative Research

Understanding the distinction between qualitative and quantitative research is essential for choosing the right data collection approach.

Aspect	Quantitative Research	Qualitative Research
Data Type	Numerical data (counts, measurements, ratings)	Text, images, audio, video, observations
Goal	Test hypotheses, measure relationships, generalize findings	Explore phenomena, understand experiences, generate theories
Sample Size	Large (statistical significance requires numbers)	Small (depth over breadth)
Analysis	Statistical methods (mean, median, regression, t-tests)	Thematic analysis, coding, content analysis
Collection Methods	Surveys with closed-ended questions, sensors, logs	Interviews, focus groups, observations, open-ended surveys
Output	Numbers, charts, statistical models	Themes, narratives, categories
Objectivity	Strives for objectivity and reproducibility	Acknowledges subjectivity; researcher is part of the process

Note

Many real-world data analysis projects use a mixed-methods approach that combines quantitative data (for statistical rigor) with qualitative data (for contextual understanding). For example, a customer satisfaction study might use survey scores (quantitative) supplemented by interview transcripts (qualitative).

Legal and Ethical Considerations

Data collection is governed by laws and ethical principles that protect individuals' privacy and rights. As a data analyst, understanding these regulations is not optional — it is a professional requirement.

GDPR (General Data Protection Regulation)

The GDPR is a European Union regulation that governs the collection, storage, and processing of personal data of EU residents. It applies to any organization that processes EU residents' data, regardless of where the organization is based.

Consent: Data subjects must give explicit, informed consent before their data is collected.
Right to access: Individuals can request a copy of all data an organization holds about them.
Right to erasure ("Right to be forgotten"): Individuals can request deletion of their personal data.
Data portability: Data subjects can request their data in a machine-readable format to transfer to another service.
Data minimization: Organizations should collect only the data that is strictly necessary for the stated purpose.
Breach notification: Organizations must report data breaches to the supervisory authority within 72 hours.
Penalties: Fines up to 20 million euros or 4% of global annual revenue, whichever is higher.

HIPAA (Health Insurance Portability and Accountability Act)

HIPAA is a U.S. federal law that protects sensitive patient health information (PHI) from being disclosed without the patient's consent or knowledge.

Protected Health Information (PHI): Any individually identifiable health information, including names, dates, medical records, and insurance details.
Covered entities: Healthcare providers, health plans, and healthcare clearinghouses.
Business associates: Third parties that handle PHI on behalf of covered entities must also comply.
Minimum necessary rule: Only the minimum amount of PHI needed for a particular purpose should be disclosed.
De-identification: HIPAA allows the use of health data that has been stripped of all 18 types of identifiers.

Other Key Regulations

CCPA (California Consumer Privacy Act): Gives California residents the right to know what personal data is collected, request deletion, and opt out of data sales.
FERPA (Family Educational Rights and Privacy Act): Protects the privacy of student education records in the U.S.
COPPA (Children's Online Privacy Protection Act): Regulates data collection from children under 13 in the U.S.

Important for Exam

Know that GDPR requires explicit consent and applies to EU residents regardless of where the data processor is located. HIPAA specifically protects health information in the U.S. These are the two most commonly tested regulations on the exam.

Data Anonymization and Personally Identifiable Information (PII)

Personally Identifiable Information (PII) is any data that can be used to identify a specific individual, either on its own or when combined with other data.

Examples of PII

Direct identifiers: Full name, Social Security number, email address, phone number, passport number, driver's license number.
Indirect identifiers (quasi-identifiers): Date of birth, ZIP code, gender, ethnicity, job title. Individually these may not identify a person, but in combination they often can.

Anonymization Techniques

Technique	Description	Example
Data Masking	Replace sensitive data with fictional but realistic data	"John Smith" becomes "User_4821"
Generalization	Replace specific values with broader categories	Exact age "34" becomes age range "30-39"
Pseudonymization	Replace identifiers with pseudonyms; a mapping key exists but is stored separately	Patient "A. Jones" becomes "P-00742"
Data Swapping	Shuffle values between records so individual links are broken	Swap ZIP codes between records in a dataset
Aggregation	Report data at group level rather than individual level	Report average salary by department instead of individual salaries
K-anonymity	Ensure each record is indistinguishable from at least k-1 other records	Each combination of quasi-identifiers appears at least k times

Key Distinction

Anonymization is irreversible — you cannot re-identify individuals. Pseudonymization is reversible if you have the mapping key. GDPR considers pseudonymized data as still personal data, while truly anonymized data falls outside GDPR's scope.

Impact on Business Strategy and Decision-Making

Data collection is not an academic exercise — it directly drives business outcomes. The methods and quality of data collection shape the insights that analysts can extract, which in turn inform strategic decisions.

How Data Collection Impacts Business

Product development: Customer surveys and usage data reveal which features users want, guiding product roadmaps.
Market research: Collecting competitive intelligence and market trends helps organizations identify opportunities and threats.
Customer segmentation: Behavioral data enables businesses to group customers by purchasing patterns, demographics, or preferences for targeted marketing.
Risk assessment: Financial institutions collect transaction data to detect fraud and assess credit risk.
Operational efficiency: IoT sensors and log data help optimize supply chains, manufacturing processes, and resource allocation.
Compliance and reporting: Regulatory requirements often mandate specific data collection practices, especially in healthcare, finance, and education.

Data-Driven vs. Data-Informed Decision-Making

Data-driven: Decisions are made primarily based on data analysis, with minimal reliance on intuition. Suitable for repeatable, well-defined problems.
Data-informed: Data is one input among many (including domain expertise, stakeholder input, and strategic context). More appropriate for complex, ambiguous situations.

Key Point

Poor data collection methods lead to poor business decisions. If a survey has selection bias, the product features built based on its results may not serve the actual target audience. Always assess data quality before drawing conclusions.

Survey Design, Audience Selection, and Structured Interviews

Designing an effective data collection instrument requires careful planning. A poorly designed survey or interview can produce misleading results regardless of sample size.

Principles of Good Survey Design

Define objectives clearly: What specific questions does the survey need to answer? Every question should map to a research objective.
Keep it concise: Long surveys lead to respondent fatigue and lower completion rates. Aim for under 15 minutes.
Use clear, unambiguous language: Avoid jargon, double-barreled questions ("Do you like the price and quality?"), and leading questions ("Don't you agree that...?").
Order questions logically: Start with easy, non-sensitive questions. Group related questions together. Place demographic questions at the end.
Provide balanced response options: Likert scales should have an equal number of positive and negative options (e.g., "Strongly Agree" to "Strongly Disagree").
Pilot test: Run the survey with a small group first to identify confusing questions, technical issues, or unexpected response patterns.
Include a mix of question types: Use closed-ended questions for quantitative analysis and a few open-ended questions for deeper insights.

Audience Selection

Target population: Define who you want to study (e.g., "customers who purchased in the last 90 days").
Sampling frame: The list from which you will draw your sample (e.g., a customer database, email list).
Inclusion/exclusion criteria: Set clear rules for who qualifies (e.g., age 18+, English-speaking, active users).
Sample size considerations: Larger samples reduce sampling error but cost more. Statistical power analysis can determine the minimum needed sample size.

Structured Interview Design

Prepare a fixed set of questions in advance, asked in the same order to every participant.
Use standardized probes (follow-up questions) to ensure consistency.
Train interviewers to minimize interviewer bias (tone, body language, interpretation).
Record interviews (with consent) for accurate transcription and analysis.
Define a coding scheme before analysis to categorize responses systematically.

Q1: Which data collection method is most appropriate for gathering in-depth, nuanced understanding of customer experiences?

A) Online survey with closed-ended questions

B) Semi-structured interviews

C) Web scraping of review websites

D) Automated log file analysis

Correct: B) Semi-structured interviews allow the interviewer to explore topics deeply with follow-up questions while maintaining a consistent framework. They produce rich qualitative data that captures nuances, emotions, and context that closed-ended surveys and automated methods cannot provide.

Q2: Under GDPR, which of the following is NOT a right granted to data subjects?

A) Right to access their personal data

B) Right to erasure (right to be forgotten)

C) Right to sell their personal data to third parties

D) Right to data portability

Correct: C) GDPR grants individuals the right to access, rectify, erase, restrict processing, data portability, and object to processing. It does NOT grant individuals the right to sell their own data. GDPR is about protecting personal data, not commercializing it.

Q3: Which Python library is used to parse HTML content when web scraping?

A) requests

B) BeautifulSoup

C) pandas

D) json

Correct: B) BeautifulSoup (from the bs4 package) is used to parse and navigate HTML/XML documents. The requests library is used to fetch the web page (HTTP request), but it does not parse the HTML structure. After fetching with requests, you pass the content to BeautifulSoup for parsing.

Q4: A researcher collects data only from people who voluntarily visit a shopping mall on a Saturday. This is an example of:

A) Simple random sampling

B) Stratified sampling

C) Convenience sampling

D) Cluster sampling

Correct: C) Convenience sampling selects participants based on ease of access rather than a systematic or random method. Surveying only mall visitors on a Saturday excludes people who do not visit malls, work on Saturdays, or live far away — making the sample non-representative of the general population.

Q5: Which of the following is an example of pseudonymization rather than true anonymization?

A) Removing all names and replacing them with nothing

B) Replacing patient names with coded IDs while keeping a separate mapping file

C) Reporting only aggregate statistics (averages, totals) without individual records

D) Generalizing exact ages into 10-year age brackets

Correct: B) Pseudonymization replaces identifiers with pseudonyms (coded IDs) while maintaining a separate key that can re-link the data to individuals. This makes re-identification possible with the key. True anonymization is irreversible — there is no way to trace data back to an individual. Under GDPR, pseudonymized data is still considered personal data.

1.1.2 Data Aggregation and Integration

In practice, data rarely lives in a single source. Data analysts must combine information from databases, APIs, spreadsheets, web pages, and other sources into a unified dataset for analysis. This process is called data integration, and the act of summarizing or combining records is called data aggregation.

Combining Data from Multiple Sources

Organizations typically store data across many systems. A retail company might have:

Relational databases (e.g., PostgreSQL, MySQL) for transactional records (orders, inventory).
APIs for real-time data from third-party services (payment processors, shipping providers, social media platforms).
File-based storage like CSV exports from legacy systems, Excel spreadsheets maintained by departments, or JSON outputs from web services.
Data warehouses that consolidate data from multiple operational databases for analytics.
Log files from web servers and applications.

ETL (Extract, Transform, Load)

The standard approach to data integration is the ETL pipeline:

Extract: Pull raw data from source systems (databases, APIs, files).
Transform: Clean, reformat, validate, and standardize the data. This includes type conversions, handling missing values, deduplication, and aligning schemas.
Load: Write the transformed data into a target system (data warehouse, data lake, or analytical database).

ELT vs. ETL

An increasingly popular alternative is ELT (Extract, Load, Transform), where raw data is loaded first into a data lake or cloud warehouse (e.g., BigQuery, Snowflake) and transformed later using SQL or processing frameworks. ELT is preferred when storage is cheap and transformation logic is complex or evolving.

Common Integration Methods in Python

pd.merge() — SQL-style joins on one or more keys (inner, left, right, outer joins).
pd.concat() — Stacking DataFrames vertically (row-wise) or horizontally (column-wise).
DataFrame.join() — Join on index by default; convenient for combining on row labels.
DataFrame.append() — Shortcut for concatenating rows (deprecated in newer pandas; use pd.concat()).

Challenges in Data Integration

Combining data from multiple sources introduces several challenges that must be addressed to ensure reliable analysis:

Format Disparities

Date formats: One system stores dates as "MM/DD/YYYY", another as "YYYY-MM-DD", and a third as Unix timestamps. Standardization is required before merging.
Encoding differences: CSV files might use UTF-8, while legacy systems use Latin-1. Python's pandas.read_csv(encoding='latin-1') handles this.
Data types: A "price" column might be a float in one source, a string with currency symbols in another ("$12.99"), and an integer (cents) in a third.
Naming conventions: The same field might be called customer_id, CustomerID, cust_no, or client_id across different systems.

Alignment Issues

Schema mismatches: Two datasets about the same entity may have different columns. One customer table might include phone while another does not.
Granularity differences: One dataset has daily sales records while another has monthly aggregates. Alignment requires either aggregating the daily data or distributing the monthly data.
Duplicate records: The same entity may appear in multiple sources with slight variations (e.g., "John Smith" vs. "J. Smith" vs. "JOHN SMITH").
Temporal alignment: Data collected at different times may reflect different states of reality. A product catalog from January may not match orders from March if products were added or removed.

Key Resolution (Entity Matching)

When combining datasets, you need a reliable join key. Common strategies include:

Using unique identifiers (customer ID, product SKU, employee number).
Fuzzy matching on names or addresses when exact keys are unavailable.
Creating composite keys from multiple fields (e.g., first name + last name + date of birth).

Data Consistency and Accuracy in Aggregated Datasets

After integrating data, you must verify that the combined dataset is consistent and accurate.

Consistency Checks

Row count verification: After a merge, check that the result has the expected number of rows. An unexpected increase suggests duplicate keys; an unexpected decrease suggests unmatched records.
Null analysis: Check for NaN values introduced by outer joins — these indicate records that existed in one source but not the other.
Cross-field validation: Ensure that related fields agree. For example, order_total should equal unit_price * quantity.
Referential integrity: Every foreign key in one dataset should have a corresponding primary key in the related dataset.

Accuracy Assurance

Source verification: Compare aggregated values against known totals from source systems.
Spot checks: Manually verify a random sample of records against the original sources.
Statistical profiling: Run summary statistics (min, max, mean, standard deviation) on key columns and compare with expectations.
Data lineage: Document where each piece of data came from and what transformations were applied, enabling traceability and debugging.

Best Practice

Always validate your merged dataset immediately after integration. Use df.info(), df.describe(), and df.isnull().sum() to quickly spot issues. Check the shape of the DataFrame before and after each merge operation.

Q1: A data analyst needs to combine customer records from a CRM database with order data from a separate transaction database. The datasets share a customer_id field. Which pandas method is most appropriate?

A) pd.merge()

B) pd.concat()

C) df.append()

D) df.update()

Correct: A) pd.merge() performs SQL-style joins on one or more common columns (keys). Since both datasets share customer_id, a merge (e.g., inner, left, or outer join) is the correct approach. pd.concat() is used for stacking DataFrames vertically or horizontally, not for key-based joins.

Q2: An ETL pipeline has three stages. Which of the following correctly describes the "Transform" stage?

A) Pulling raw data from source databases and APIs

B) Cleaning, reformatting, validating, and standardizing the extracted data

C) Writing the final data into the target data warehouse

D) Backing up the source systems before extraction

Correct: B) In an ETL pipeline: Extract pulls raw data from sources, Transform cleans, reformats, validates, and standardizes the data (handling missing values, type conversions, deduplication, schema alignment), and Load writes the processed data into the target system.

Q3: After performing a left merge of a customers DataFrame (500 rows) with an orders DataFrame on customer_id, the resulting DataFrame has 650 rows. What is the most likely explanation?

A) There are 150 customers with no orders

B) Some customers have multiple orders, creating duplicate customer rows

C) The merge operation added 150 empty rows

D) The orders DataFrame has 150 columns more than the customers DataFrame

Correct: B) In a left merge, every row from the left DataFrame is preserved. If a customer has 3 orders, that customer's row will appear 3 times in the result (once for each matching order). The increase from 500 to 650 rows means 150 additional rows were created by customers who had more than one order. This is a one-to-many relationship.

1.1.3 Data Storage Solutions

Choosing the right storage solution depends on data volume, access patterns, query complexity, budget, and team expertise. Data analysts must understand the options available and their trade-offs.

Data Warehouses vs. Data Lakes

Data Warehouses

A data warehouse is a centralized repository designed for query and analysis rather than transaction processing. Data is typically cleaned, transformed, and structured (schema-on-write) before being loaded.

Structure: Highly structured with predefined schemas (star schema, snowflake schema).
Data type: Primarily structured data (relational tables with defined columns and types).
Optimization: Optimized for read-heavy analytical queries (OLAP — Online Analytical Processing).
Users: Business analysts, data analysts, BI teams who write SQL queries and build dashboards.
Examples: Amazon Redshift, Google BigQuery, Snowflake, Azure Synapse Analytics.
Strengths: Fast query performance, data quality enforcement, strong governance.
Weaknesses: Expensive to scale, inflexible schema changes, does not handle unstructured data well.

Data Lakes

A data lake is a centralized repository that stores data in its raw, native format. Data structure is applied at read time (schema-on-read) rather than write time.

Structure: Schema-on-read; data is stored as-is without predefined structure.
Data type: Structured, semi-structured (JSON, XML), and unstructured (images, videos, logs, PDFs).
Optimization: Optimized for storage volume and data variety rather than query speed.
Users: Data scientists, data engineers, ML engineers who need access to raw data.
Examples: Amazon S3 + Athena, Azure Data Lake Storage, Google Cloud Storage + BigQuery, Hadoop HDFS.
Strengths: Handles any data type, very cost-effective for large volumes, flexible.
Weaknesses: Can become a "data swamp" without governance, slower queries on unprocessed data, requires more technical skill.

Feature	Data Warehouse	Data Lake
Schema	Schema-on-write (predefined)	Schema-on-read (flexible)
Data Types	Structured only	Structured, semi-structured, unstructured
Cost	Higher (compute + storage tightly coupled)	Lower (cheap object storage)
Query Speed	Fast (optimized indexes, partitions)	Slower unless data is processed
Best For	BI dashboards, reporting, SQL analytics	ML, data exploration, raw data archival
Governance	Strong built-in governance	Requires additional tooling

Data Lakehouse

A modern hybrid approach called a data lakehouse combines the flexibility of a data lake with the performance and governance of a data warehouse. Technologies like Delta Lake, Apache Iceberg, and Apache Hudi enable ACID transactions and schema enforcement on top of data lake storage.

File-Based Storage (CSV, Excel, JSON)

For smaller datasets and quick data exchange, file-based formats remain ubiquitous. Every data analyst works with these formats daily.

CSV (Comma-Separated Values)

Plain text format where each line is a row and values are separated by commas (or other delimiters like tabs or semicolons).
Advantages: Universal compatibility, human-readable, small file size, supported by every programming language and data tool.
Limitations: No data type information (everything is text), no support for hierarchical data, no standard for encoding or escaping, large files can be slow to parse.
Python: pd.read_csv() and df.to_csv()

Excel (XLSX)

Microsoft's spreadsheet format supporting multiple sheets, formulas, formatting, and charts.
Advantages: Widely used in business, supports multiple sheets, preserves some data types and formatting.
Limitations: Binary format (not human-readable raw), row limit of ~1 million rows, larger file size than CSV, formulas can obscure actual data values.
Python: pd.read_excel() and df.to_excel() (requires openpyxl or xlrd)

JSON (JavaScript Object Notation)

A lightweight, human-readable format for representing structured and semi-structured data as key-value pairs and arrays.
Advantages: Supports nested/hierarchical data, widely used in APIs and web services, language-agnostic.
Limitations: Larger file size than CSV for tabular data, parsing nested JSON into flat DataFrames requires additional work, no native support for dates or binary data.
Python: pd.read_json(), json.load(), json.loads()

Other Formats

Parquet: Columnar storage format optimized for analytics. Very efficient compression and fast query performance. Standard in big data ecosystems. pd.read_parquet()
XML (eXtensible Markup Language): Hierarchical, tag-based format common in legacy systems and configuration files.
SQLite: A file-based relational database. Useful for local analytics without setting up a server. sqlite3 module in Python.
HDF5: Hierarchical Data Format for large numerical datasets, common in scientific computing.

Cloud Storage Solutions

Cloud storage has transformed how organizations store and access data. Instead of maintaining physical servers, data is stored on remote infrastructure managed by cloud providers.

Major Cloud Storage Services

Provider	Object Storage	Data Warehouse	Data Lake
AWS	Amazon S3	Amazon Redshift	S3 + Athena / Lake Formation
Google Cloud	Cloud Storage	BigQuery	Cloud Storage + BigQuery
Microsoft Azure	Blob Storage	Azure Synapse Analytics	Azure Data Lake Storage

Benefits of Cloud Storage

Scalability: Storage grows and shrinks on demand. No need to provision hardware in advance.
Cost efficiency: Pay only for what you use. Tiered pricing (hot, warm, cold, archive) lets you optimize costs based on access frequency.
Durability and availability: Major providers offer 99.999999999% (11 nines) durability through data replication across regions.
Accessibility: Data can be accessed from anywhere with an internet connection, enabling remote teams and global collaboration.
Integration: Cloud storage services integrate natively with analytics tools, ML platforms, and data pipelines.
Security: Encryption at rest and in transit, IAM (Identity and Access Management), audit logging.

Study Tip

For the exam, understand the conceptual differences between data warehouses and data lakes, and know that CSV and JSON are the most common file-based formats you will work with in Python. You are not expected to be a cloud architect, but you should understand the role of cloud storage in modern data workflows.

Q1: What is the key difference between a data warehouse and a data lake?

A) Data warehouses are cloud-based; data lakes are on-premises only

B) Data lakes can only store structured data

C) Data warehouses use schema-on-write; data lakes use schema-on-read

D) Data lakes require SQL for all queries

Correct: C) Data warehouses enforce structure at write time (schema-on-write) — data is cleaned and transformed before loading. Data lakes store raw data in any format and apply structure when reading (schema-on-read). Both can be cloud-based or on-premises. Data lakes store structured, semi-structured, and unstructured data.

Q2: When using pd.read_csv(), which parameter would you use to handle a file that uses semicolons instead of commas as delimiters?

A) delimiter=","

B) sep=";"

C) encoding="semicolon"

D) header=";"

Correct: B) The sep parameter (or its alias delimiter) specifies the character used to separate fields in the CSV file. By default it is a comma, but many European CSV files use semicolons. Use pd.read_csv("file.csv", sep=";") to handle this.

Python Code Examples

The following examples demonstrate practical implementations of the data collection, integration, and storage concepts discussed above.

Example 1: Web Scraping with BeautifulSoup

This example demonstrates how to fetch a web page and extract structured data from an HTML table.

# Web Scraping with BeautifulSoup
import requests
from bs4 import BeautifulSoup
import pandas as pd

# Step 1: Send an HTTP GET request to the target URL
url = "https://example.com/data-table"
response = requests.get(url)

# Step 2: Check that the request was successful
if response.status_code == 200:
    # Step 3: Parse the HTML content
    soup = BeautifulSoup(response.text, "html.parser")

    # Step 4: Find the target table
    table = soup.find("table", {"class": "data-table"})

    # Step 5: Extract headers
    headers = []
    for th in table.find_all("th"):
        headers.append(th.text.strip())

    # Step 6: Extract rows
    rows = []
    for tr in table.find_all("tr")[1:]:  # Skip header row
        cells = [td.text.strip() for td in tr.find_all("td")]
        rows.append(cells)

    # Step 7: Create a DataFrame
    df = pd.DataFrame(rows, columns=headers)
    print(df.head())

    # Step 8: Save to CSV for later use
    df.to_csv("scraped_data.csv", index=False)
else:
    print(f"Request failed with status code: {response.status_code}")

Note

Always check the website's robots.txt file and Terms of Service before scraping. Respect rate limits by adding delays (time.sleep()) between requests. Use response.raise_for_status() for cleaner error handling.

Example 2: Reading from CSV and JSON Files

Reading data from common file formats into pandas DataFrames.

import pandas as pd
import json

# ---- Reading CSV ----
# Basic CSV reading
df_csv = pd.read_csv("sales_data.csv")

# CSV with specific options
df_csv = pd.read_csv(
    "sales_data.csv",
    sep=",",                    # Delimiter (default is comma)
    encoding="utf-8",            # File encoding
    parse_dates=["order_date"],  # Auto-parse date columns
    dtype={"zip_code": str},     # Force column types
    na_values=["N/A", "missing"], # Custom NA markers
    nrows=1000                   # Read only first 1000 rows
)

print("CSV Shape:", df_csv.shape)
print(df_csv.dtypes)
print(df_csv.head())

# ---- Reading JSON ----
# Flat JSON (array of records)
df_json = pd.read_json("products.json")

# Nested JSON requires json module + pd.json_normalize
with open("nested_data.json", "r") as f:
    raw_data = json.load(f)

# Flatten nested structure
df_nested = pd.json_normalize(
    raw_data["results"],
    record_path=["items"],         # Path to nested records
    meta=["order_id", "customer"],  # Include parent fields
    sep="_"                          # Separator for nested keys
)

print("JSON Shape:", df_nested.shape)
print(df_nested.head())

# ---- Reading Excel ----
df_excel = pd.read_excel(
    "report.xlsx",
    sheet_name="Q1_Sales",  # Specific sheet
    skiprows=2,              # Skip header rows
    usecols="A:F"            # Read only columns A through F
)
print("Excel Shape:", df_excel.shape)

Example 3: Combining DataFrames from Multiple Sources

Merging, concatenating, and joining DataFrames to create unified datasets.

import pandas as pd

# ---- Create sample DataFrames ----
# Customer data from the CRM system
customers = pd.DataFrame({
    "customer_id": [101, 102, 103, 104, 105],
    "name": ["Alice", "Bob", "Charlie", "Diana", "Eve"],
    "region": ["East", "West", "East", "South", "West"]
})

# Order data from the transaction database
orders = pd.DataFrame({
    "order_id": [1001, 1002, 1003, 1004, 1005],
    "customer_id": [101, 102, 101, 103, 106],
    "amount": [250.00, 150.75, 89.99, 320.50, 175.00]
})

# ---- MERGE: SQL-style joins ----
# Inner join: only matching customer_ids
inner = pd.merge(customers, orders, on="customer_id", how="inner")
print("Inner join shape:", inner.shape)  # 4 rows (106 excluded)

# Left join: all customers, with orders where available
left = pd.merge(customers, orders, on="customer_id", how="left")
print("Left join shape:", left.shape)   # 5 rows (Diana, Eve have NaN orders)

# Outer join: all records from both tables
outer = pd.merge(customers, orders, on="customer_id", how="outer")
print("Outer join shape:", outer.shape)  # 6 rows

# ---- CONCAT: Stacking DataFrames ----
# Data from different time periods with the same schema
q1_sales = pd.DataFrame({
    "month": ["Jan", "Feb", "Mar"],
    "revenue": [45000, 52000, 48000]
})

q2_sales = pd.DataFrame({
    "month": ["Apr", "May", "Jun"],
    "revenue": [51000, 49000, 55000]
})

# Vertical concatenation (stacking rows)
all_sales = pd.concat([q1_sales, q2_sales], ignore_index=True)
print(all_sales)

# ---- Verify integration quality ----
print("\n--- Integration Quality Checks ---")
print("Null values:\n", outer.isnull().sum())
print("Duplicate rows:", outer.duplicated().sum())
print("Shape:", outer.shape)

Example 4: API Data Retrieval with Requests

Fetching data from a REST API, handling pagination, and converting the response to a DataFrame.

import requests
import pandas as pd

# ---- Basic API Request ----
url = "https://api.example.com/v1/employees"
headers = {
    "Authorization": "Bearer YOUR_API_KEY",
    "Content-Type": "application/json"
}
params = {
    "department": "engineering",
    "limit": 100
}

response = requests.get(url, headers=headers, params=params)
response.raise_for_status()  # Raise exception for HTTP errors

# Parse JSON response
data = response.json()
df = pd.DataFrame(data["employees"])
print(df.head())

# ---- Handling Paginated API Responses ----
def fetch_all_pages(base_url, headers):
    """Fetch all pages from a paginated API endpoint."""
    all_records = []
    page = 1

    while True:
        params = {"page": page, "per_page": 50}
        response = requests.get(base_url, headers=headers, params=params)
        response.raise_for_status()

        data = response.json()
        records = data.get("results", [])

        if not records:
            break  # No more data

        all_records.extend(records)
        print(f"Fetched page {page} - {len(records)} records")

        # Check if there are more pages
        if page >= data.get("total_pages", page):
            break

        page += 1

    return pd.DataFrame(all_records)

# Usage
api_url = "https://api.example.com/v1/transactions"
df_api = fetch_all_pages(api_url, headers)
print(f"Total records fetched: {len(df_api)}")

# ---- Error Handling Pattern ----
try:
    response = requests.get(url, headers=headers, timeout=30)
    response.raise_for_status()
    data = response.json()
except requests.exceptions.Timeout:
    print("Request timed out. Try again later.")
except requests.exceptions.HTTPError as e:
    print(f"HTTP error occurred: {e}")
except requests.exceptions.RequestException as e:
    print(f"An error occurred: {e}")

Study Tip

Know the difference between requests.get() and requests.post(). GET is used to retrieve data (read-only), while POST is used to send data to a server (create/update). For the exam, focus on GET requests for data retrieval. Always use response.raise_for_status() or check response.status_code for error handling.

Practice Quiz: Data Collection, Integration, and Storage

Test your understanding of the concepts covered in this topic. Click on the option you believe is correct.

Q1: Which data collection method is most appropriate for gathering in-depth, nuanced understanding of customer experiences?

A) Online survey with closed-ended questions

B) Semi-structured interviews

C) Web scraping of review websites

D) Automated log file analysis

Q2: Under GDPR, which of the following is NOT a right granted to data subjects?

A) Right to access their personal data

B) Right to erasure (right to be forgotten)

C) Right to sell their personal data to third parties

D) Right to data portability

Q3: A data analyst needs to combine customer records from a CRM database with order data from a separate transaction database. The datasets share a customer_id field. Which pandas method is most appropriate?

A) pd.merge()

B) pd.concat()

C) df.append()

D) df.update()

Q4: What is the key difference between a data warehouse and a data lake?

A) Data warehouses are cloud-based; data lakes are on-premises only

B) Data lakes can only store structured data

C) Data warehouses use schema-on-write; data lakes use schema-on-read

D) Data lakes require SQL for all queries

Q5: Which Python library is used to parse HTML content when web scraping?

A) requests

B) BeautifulSoup

C) pandas

D) json

Q6: A researcher collects data only from people who voluntarily visit a shopping mall on a Saturday. This is an example of:

A) Simple random sampling

B) Stratified sampling

C) Convenience sampling

D) Cluster sampling

Q7: Which of the following is an example of pseudonymization rather than true anonymization?

A) Removing all names and replacing them with nothing

B) Replacing patient names with coded IDs while keeping a separate mapping file

C) Reporting only aggregate statistics (averages, totals) without individual records

D) Generalizing exact ages into 10-year age brackets

Q8: When using pd.read_csv(), which parameter would you use to handle a file that uses semicolons instead of commas as delimiters?

A) delimiter=","

B) sep=";"

C) encoding="semicolon"

D) header=";"

Q9: An ETL pipeline has three stages. Which of the following correctly describes the "Transform" stage?

A) Pulling raw data from source databases and APIs

B) Cleaning, reformatting, validating, and standardizing the extracted data

C) Writing the final data into the target data warehouse

D) Backing up the source systems before extraction

Q10: After performing a left merge of a customers DataFrame (500 rows) with an orders DataFrame on customer_id, the resulting DataFrame has 650 rows. What is the most likely explanation?

A) There are 150 customers with no orders

B) Some customers have multiple orders, creating duplicate customer rows

C) The merge operation added 150 empty rows

D) The orders DataFrame has 150 columns more than the customers DataFrame

This is the first topic Next: 1.2 Data Cleaning and Standardization

Data Collection, Integration, and Storage

1.1.1 Data Collection Methods

Surveys, Interviews, and Web Scraping

Surveys

Interviews

Web Scraping

Study Tip

Representative Sampling and Challenges in Data Collection

Types of Sampling

Common Challenges in Data Collection

Common Pitfall

Qualitative vs. Quantitative Research

Note

Legal and Ethical Considerations

GDPR (General Data Protection Regulation)

HIPAA (Health Insurance Portability and Accountability Act)

Other Key Regulations

Important for Exam

Data Anonymization and Personally Identifiable Information (PII)

Examples of PII

Anonymization Techniques

Key Distinction

Impact on Business Strategy and Decision-Making

How Data Collection Impacts Business

Data-Driven vs. Data-Informed Decision-Making

Key Point

Survey Design, Audience Selection, and Structured Interviews

Principles of Good Survey Design

Audience Selection

Structured Interview Design

1.1.2 Data Aggregation and Integration

Combining Data from Multiple Sources

ETL (Extract, Transform, Load)

ELT vs. ETL

Common Integration Methods in Python

Challenges in Data Integration

Format Disparities

Alignment Issues

Key Resolution (Entity Matching)

Data Consistency and Accuracy in Aggregated Datasets

Consistency Checks

Accuracy Assurance

Best Practice

1.1.3 Data Storage Solutions

Data Warehouses vs. Data Lakes

Data Warehouses

Data Lakes

Data Lakehouse

File-Based Storage (CSV, Excel, JSON)

CSV (Comma-Separated Values)

Excel (XLSX)

JSON (JavaScript Object Notation)

Other Formats

Cloud Storage Solutions

Major Cloud Storage Services

Benefits of Cloud Storage

Study Tip

Python Code Examples

Example 1: Web Scraping with BeautifulSoup

Note

Example 2: Reading from CSV and JSON Files

Example 3: Combining DataFrames from Multiple Sources

Example 4: API Data Retrieval with Requests

Study Tip

Practice Quiz: Data Collection, Integration, and Storage

Navigation