Home Collection Integration Storage Code Quiz

Data Collection, Integration, and Storage

Block 1: Data Acquisition and Pre-Processing

Topic 1.1 · 3 Objectives

1.1.1 Data Collection Methods

Data collection is the systematic process of gathering information from relevant sources to answer research questions, test hypotheses, or inform business decisions. The quality of any data analysis is fundamentally limited by the quality of the data collected. As the saying goes: "Garbage in, garbage out."

Data analysts must understand the full spectrum of collection methods, their trade-offs, and the ethical boundaries that govern them. In the exam, you are expected to know the differences between primary and secondary data collection, the advantages and limitations of each method, and how the choice of method affects downstream analysis.

Surveys, Interviews, and Web Scraping

Surveys

Surveys are one of the most widely used data collection instruments. They allow researchers to gather data from a large number of respondents efficiently. Surveys can be administered online (Google Forms, SurveyMonkey), by phone, by mail, or in person.

  • Closed-ended questions: Provide predefined answer options (e.g., Likert scales, multiple choice). Easy to quantify and analyze statistically.
  • Open-ended questions: Allow respondents to answer in their own words. Produce qualitative data that requires text analysis or manual coding.
  • Advantages: Cost-effective for large samples, standardized responses, easy to analyze quantitatively.
  • Limitations: Response bias, low response rates, limited depth of understanding, cannot capture context or body language.

Interviews

Interviews involve direct conversation between the researcher and the respondent. They provide richer, more nuanced data than surveys but are more time-consuming and expensive.

  • Structured interviews: Follow a rigid script of predetermined questions asked in a fixed order. Produces consistent, comparable data across respondents.
  • Semi-structured interviews: Have a guide of key topics but allow the interviewer to probe deeper based on responses. Balance between consistency and flexibility.
  • Unstructured interviews: Open-ended conversations with minimal predefined questions. Best for exploratory research where the topic is not well understood.
  • Advantages: Rich qualitative data, ability to clarify ambiguous responses, captures non-verbal cues (in person).
  • Limitations: Time-intensive, expensive, potential interviewer bias, difficult to scale, harder to quantify.

Web Scraping

Web scraping is the automated extraction of data from websites. It is particularly useful when data is publicly available on the web but not provided through a formal API or downloadable dataset.

  • Common tools: Python's BeautifulSoup (for parsing HTML), requests (for making HTTP requests), Scrapy (a full scraping framework), and Selenium (for JavaScript-rendered pages).
  • Typical use cases: Extracting product prices from e-commerce sites, gathering news articles, collecting job listings, monitoring social media mentions.
  • Advantages: Access to large amounts of publicly available data, automatable, can be scheduled to run periodically.
  • Limitations: Website structure changes can break scrapers, legal and ethical concerns (Terms of Service), rate limiting, potential for IP blocking.

Study Tip

This course expects you to know that BeautifulSoup parses HTML/XML content and that the requests library is used to fetch web pages. Know the difference between scraping and using an API.

Representative Sampling and Challenges in Data Collection

A sample is a subset of a larger population that is selected for study. The goal of sampling is to obtain a group that accurately represents the population so that findings can be generalized.

Types of Sampling

Sampling Method Description Pros Cons
Simple Random Every member of the population has an equal chance of being selected Eliminates selection bias Requires a complete list of the population
Stratified Population is divided into subgroups (strata), and random samples are drawn from each stratum Ensures representation of all subgroups Requires knowledge of population characteristics
Cluster Population is divided into clusters (often geographic), and entire clusters are randomly selected Practical for geographically dispersed populations Higher sampling error than simple random
Convenience Selecting individuals who are easiest to reach Fast and inexpensive High risk of bias; results not generalizable
Systematic Every k-th individual is selected from an ordered list Simple to implement Can introduce bias if there is a pattern in the list

Common Challenges in Data Collection

  • Non-response bias: When the people who do not respond differ systematically from those who do. For example, a health survey might under-represent very busy or very ill individuals.
  • Selection bias: When the sample does not represent the population. For instance, conducting an online survey excludes people without internet access.
  • Measurement error: Inaccurate data due to faulty instruments, poorly worded questions, or respondent misunderstanding.
  • Social desirability bias: Respondents give answers they think are socially acceptable rather than truthful.
  • Missing data: Incomplete responses that can skew results if not handled properly (deletion, imputation, etc.).
  • Data decay: Information becomes outdated over time, reducing its relevance and accuracy.

Common Pitfall

Convenience sampling is the most common form of biased sampling. While it is fast, the results cannot be generalized to the broader population. For exam questions, always consider whether the sampling method introduces bias.

Qualitative vs. Quantitative Research

Understanding the distinction between qualitative and quantitative research is essential for choosing the right data collection approach.

Aspect Quantitative Research Qualitative Research
Data Type Numerical data (counts, measurements, ratings) Text, images, audio, video, observations
Goal Test hypotheses, measure relationships, generalize findings Explore phenomena, understand experiences, generate theories
Sample Size Large (statistical significance requires numbers) Small (depth over breadth)
Analysis Statistical methods (mean, median, regression, t-tests) Thematic analysis, coding, content analysis
Collection Methods Surveys with closed-ended questions, sensors, logs Interviews, focus groups, observations, open-ended surveys
Output Numbers, charts, statistical models Themes, narratives, categories
Objectivity Strives for objectivity and reproducibility Acknowledges subjectivity; researcher is part of the process

Note

Many real-world data analysis projects use a mixed-methods approach that combines quantitative data (for statistical rigor) with qualitative data (for contextual understanding). For example, a customer satisfaction study might use survey scores (quantitative) supplemented by interview transcripts (qualitative).

Data Anonymization and Personally Identifiable Information (PII)

Personally Identifiable Information (PII) is any data that can be used to identify a specific individual, either on its own or when combined with other data.

Examples of PII

  • Direct identifiers: Full name, Social Security number, email address, phone number, passport number, driver's license number.
  • Indirect identifiers (quasi-identifiers): Date of birth, ZIP code, gender, ethnicity, job title. Individually these may not identify a person, but in combination they often can.

Anonymization Techniques

Technique Description Example
Data Masking Replace sensitive data with fictional but realistic data "John Smith" becomes "User_4821"
Generalization Replace specific values with broader categories Exact age "34" becomes age range "30-39"
Pseudonymization Replace identifiers with pseudonyms; a mapping key exists but is stored separately Patient "A. Jones" becomes "P-00742"
Data Swapping Shuffle values between records so individual links are broken Swap ZIP codes between records in a dataset
Aggregation Report data at group level rather than individual level Report average salary by department instead of individual salaries
K-anonymity Ensure each record is indistinguishable from at least k-1 other records Each combination of quasi-identifiers appears at least k times

Key Distinction

Anonymization is irreversible — you cannot re-identify individuals. Pseudonymization is reversible if you have the mapping key. GDPR considers pseudonymized data as still personal data, while truly anonymized data falls outside GDPR's scope.

Impact on Business Strategy and Decision-Making

Data collection is not an academic exercise — it directly drives business outcomes. The methods and quality of data collection shape the insights that analysts can extract, which in turn inform strategic decisions.

How Data Collection Impacts Business

  • Product development: Customer surveys and usage data reveal which features users want, guiding product roadmaps.
  • Market research: Collecting competitive intelligence and market trends helps organizations identify opportunities and threats.
  • Customer segmentation: Behavioral data enables businesses to group customers by purchasing patterns, demographics, or preferences for targeted marketing.
  • Risk assessment: Financial institutions collect transaction data to detect fraud and assess credit risk.
  • Operational efficiency: IoT sensors and log data help optimize supply chains, manufacturing processes, and resource allocation.
  • Compliance and reporting: Regulatory requirements often mandate specific data collection practices, especially in healthcare, finance, and education.

Data-Driven vs. Data-Informed Decision-Making

  • Data-driven: Decisions are made primarily based on data analysis, with minimal reliance on intuition. Suitable for repeatable, well-defined problems.
  • Data-informed: Data is one input among many (including domain expertise, stakeholder input, and strategic context). More appropriate for complex, ambiguous situations.

Key Point

Poor data collection methods lead to poor business decisions. If a survey has selection bias, the product features built based on its results may not serve the actual target audience. Always assess data quality before drawing conclusions.

Survey Design, Audience Selection, and Structured Interviews

Designing an effective data collection instrument requires careful planning. A poorly designed survey or interview can produce misleading results regardless of sample size.

Principles of Good Survey Design

  1. Define objectives clearly: What specific questions does the survey need to answer? Every question should map to a research objective.
  2. Keep it concise: Long surveys lead to respondent fatigue and lower completion rates. Aim for under 15 minutes.
  3. Use clear, unambiguous language: Avoid jargon, double-barreled questions ("Do you like the price and quality?"), and leading questions ("Don't you agree that...?").
  4. Order questions logically: Start with easy, non-sensitive questions. Group related questions together. Place demographic questions at the end.
  5. Provide balanced response options: Likert scales should have an equal number of positive and negative options (e.g., "Strongly Agree" to "Strongly Disagree").
  6. Pilot test: Run the survey with a small group first to identify confusing questions, technical issues, or unexpected response patterns.
  7. Include a mix of question types: Use closed-ended questions for quantitative analysis and a few open-ended questions for deeper insights.

Audience Selection

  • Target population: Define who you want to study (e.g., "customers who purchased in the last 90 days").
  • Sampling frame: The list from which you will draw your sample (e.g., a customer database, email list).
  • Inclusion/exclusion criteria: Set clear rules for who qualifies (e.g., age 18+, English-speaking, active users).
  • Sample size considerations: Larger samples reduce sampling error but cost more. Statistical power analysis can determine the minimum needed sample size.

Structured Interview Design

  • Prepare a fixed set of questions in advance, asked in the same order to every participant.
  • Use standardized probes (follow-up questions) to ensure consistency.
  • Train interviewers to minimize interviewer bias (tone, body language, interpretation).
  • Record interviews (with consent) for accurate transcription and analysis.
  • Define a coding scheme before analysis to categorize responses systematically.
Q1: Which data collection method is most appropriate for gathering in-depth, nuanced understanding of customer experiences?
A) Online survey with closed-ended questions
B) Semi-structured interviews
C) Web scraping of review websites
D) Automated log file analysis
Correct: B) Semi-structured interviews allow the interviewer to explore topics deeply with follow-up questions while maintaining a consistent framework. They produce rich qualitative data that captures nuances, emotions, and context that closed-ended surveys and automated methods cannot provide.
Q2: Under GDPR, which of the following is NOT a right granted to data subjects?
A) Right to access their personal data
B) Right to erasure (right to be forgotten)
C) Right to sell their personal data to third parties
D) Right to data portability
Correct: C) GDPR grants individuals the right to access, rectify, erase, restrict processing, data portability, and object to processing. It does NOT grant individuals the right to sell their own data. GDPR is about protecting personal data, not commercializing it.
Q3: Which Python library is used to parse HTML content when web scraping?
A) requests
B) BeautifulSoup
C) pandas
D) json
Correct: B) BeautifulSoup (from the bs4 package) is used to parse and navigate HTML/XML documents. The requests library is used to fetch the web page (HTTP request), but it does not parse the HTML structure. After fetching with requests, you pass the content to BeautifulSoup for parsing.
Q4: A researcher collects data only from people who voluntarily visit a shopping mall on a Saturday. This is an example of:
A) Simple random sampling
B) Stratified sampling
C) Convenience sampling
D) Cluster sampling
Correct: C) Convenience sampling selects participants based on ease of access rather than a systematic or random method. Surveying only mall visitors on a Saturday excludes people who do not visit malls, work on Saturdays, or live far away — making the sample non-representative of the general population.
Q5: Which of the following is an example of pseudonymization rather than true anonymization?
A) Removing all names and replacing them with nothing
B) Replacing patient names with coded IDs while keeping a separate mapping file
C) Reporting only aggregate statistics (averages, totals) without individual records
D) Generalizing exact ages into 10-year age brackets
Correct: B) Pseudonymization replaces identifiers with pseudonyms (coded IDs) while maintaining a separate key that can re-link the data to individuals. This makes re-identification possible with the key. True anonymization is irreversible — there is no way to trace data back to an individual. Under GDPR, pseudonymized data is still considered personal data.

1.1.2 Data Aggregation and Integration

In practice, data rarely lives in a single source. Data analysts must combine information from databases, APIs, spreadsheets, web pages, and other sources into a unified dataset for analysis. This process is called data integration, and the act of summarizing or combining records is called data aggregation.

Combining Data from Multiple Sources

Organizations typically store data across many systems. A retail company might have:

  • Relational databases (e.g., PostgreSQL, MySQL) for transactional records (orders, inventory).
  • APIs for real-time data from third-party services (payment processors, shipping providers, social media platforms).
  • File-based storage like CSV exports from legacy systems, Excel spreadsheets maintained by departments, or JSON outputs from web services.
  • Data warehouses that consolidate data from multiple operational databases for analytics.
  • Log files from web servers and applications.

ETL (Extract, Transform, Load)

The standard approach to data integration is the ETL pipeline:

  1. Extract: Pull raw data from source systems (databases, APIs, files).
  2. Transform: Clean, reformat, validate, and standardize the data. This includes type conversions, handling missing values, deduplication, and aligning schemas.
  3. Load: Write the transformed data into a target system (data warehouse, data lake, or analytical database).

ELT vs. ETL

An increasingly popular alternative is ELT (Extract, Load, Transform), where raw data is loaded first into a data lake or cloud warehouse (e.g., BigQuery, Snowflake) and transformed later using SQL or processing frameworks. ELT is preferred when storage is cheap and transformation logic is complex or evolving.

Common Integration Methods in Python

  • pd.merge() — SQL-style joins on one or more keys (inner, left, right, outer joins).
  • pd.concat() — Stacking DataFrames vertically (row-wise) or horizontally (column-wise).
  • DataFrame.join() — Join on index by default; convenient for combining on row labels.
  • DataFrame.append() — Shortcut for concatenating rows (deprecated in newer pandas; use pd.concat()).

Challenges in Data Integration

Combining data from multiple sources introduces several challenges that must be addressed to ensure reliable analysis:

Format Disparities

  • Date formats: One system stores dates as "MM/DD/YYYY", another as "YYYY-MM-DD", and a third as Unix timestamps. Standardization is required before merging.
  • Encoding differences: CSV files might use UTF-8, while legacy systems use Latin-1. Python's pandas.read_csv(encoding='latin-1') handles this.
  • Data types: A "price" column might be a float in one source, a string with currency symbols in another ("$12.99"), and an integer (cents) in a third.
  • Naming conventions: The same field might be called customer_id, CustomerID, cust_no, or client_id across different systems.

Alignment Issues

  • Schema mismatches: Two datasets about the same entity may have different columns. One customer table might include phone while another does not.
  • Granularity differences: One dataset has daily sales records while another has monthly aggregates. Alignment requires either aggregating the daily data or distributing the monthly data.
  • Duplicate records: The same entity may appear in multiple sources with slight variations (e.g., "John Smith" vs. "J. Smith" vs. "JOHN SMITH").
  • Temporal alignment: Data collected at different times may reflect different states of reality. A product catalog from January may not match orders from March if products were added or removed.

Key Resolution (Entity Matching)

When combining datasets, you need a reliable join key. Common strategies include:

  • Using unique identifiers (customer ID, product SKU, employee number).
  • Fuzzy matching on names or addresses when exact keys are unavailable.
  • Creating composite keys from multiple fields (e.g., first name + last name + date of birth).

Data Consistency and Accuracy in Aggregated Datasets

After integrating data, you must verify that the combined dataset is consistent and accurate.

Consistency Checks

  • Row count verification: After a merge, check that the result has the expected number of rows. An unexpected increase suggests duplicate keys; an unexpected decrease suggests unmatched records.
  • Null analysis: Check for NaN values introduced by outer joins — these indicate records that existed in one source but not the other.
  • Cross-field validation: Ensure that related fields agree. For example, order_total should equal unit_price * quantity.
  • Referential integrity: Every foreign key in one dataset should have a corresponding primary key in the related dataset.

Accuracy Assurance

  • Source verification: Compare aggregated values against known totals from source systems.
  • Spot checks: Manually verify a random sample of records against the original sources.
  • Statistical profiling: Run summary statistics (min, max, mean, standard deviation) on key columns and compare with expectations.
  • Data lineage: Document where each piece of data came from and what transformations were applied, enabling traceability and debugging.

Best Practice

Always validate your merged dataset immediately after integration. Use df.info(), df.describe(), and df.isnull().sum() to quickly spot issues. Check the shape of the DataFrame before and after each merge operation.

Q1: A data analyst needs to combine customer records from a CRM database with order data from a separate transaction database. The datasets share a customer_id field. Which pandas method is most appropriate?
A) pd.merge()
B) pd.concat()
C) df.append()
D) df.update()
Correct: A) pd.merge() performs SQL-style joins on one or more common columns (keys). Since both datasets share customer_id, a merge (e.g., inner, left, or outer join) is the correct approach. pd.concat() is used for stacking DataFrames vertically or horizontally, not for key-based joins.
Q2: An ETL pipeline has three stages. Which of the following correctly describes the "Transform" stage?
A) Pulling raw data from source databases and APIs
B) Cleaning, reformatting, validating, and standardizing the extracted data
C) Writing the final data into the target data warehouse
D) Backing up the source systems before extraction
Correct: B) In an ETL pipeline: Extract pulls raw data from sources, Transform cleans, reformats, validates, and standardizes the data (handling missing values, type conversions, deduplication, schema alignment), and Load writes the processed data into the target system.
Q3: After performing a left merge of a customers DataFrame (500 rows) with an orders DataFrame on customer_id, the resulting DataFrame has 650 rows. What is the most likely explanation?
A) There are 150 customers with no orders
B) Some customers have multiple orders, creating duplicate customer rows
C) The merge operation added 150 empty rows
D) The orders DataFrame has 150 columns more than the customers DataFrame
Correct: B) In a left merge, every row from the left DataFrame is preserved. If a customer has 3 orders, that customer's row will appear 3 times in the result (once for each matching order). The increase from 500 to 650 rows means 150 additional rows were created by customers who had more than one order. This is a one-to-many relationship.

1.1.3 Data Storage Solutions

Choosing the right storage solution depends on data volume, access patterns, query complexity, budget, and team expertise. Data analysts must understand the options available and their trade-offs.

Data Warehouses vs. Data Lakes

Data Warehouses

A data warehouse is a centralized repository designed for query and analysis rather than transaction processing. Data is typically cleaned, transformed, and structured (schema-on-write) before being loaded.

  • Structure: Highly structured with predefined schemas (star schema, snowflake schema).
  • Data type: Primarily structured data (relational tables with defined columns and types).
  • Optimization: Optimized for read-heavy analytical queries (OLAP — Online Analytical Processing).
  • Users: Business analysts, data analysts, BI teams who write SQL queries and build dashboards.
  • Examples: Amazon Redshift, Google BigQuery, Snowflake, Azure Synapse Analytics.
  • Strengths: Fast query performance, data quality enforcement, strong governance.
  • Weaknesses: Expensive to scale, inflexible schema changes, does not handle unstructured data well.

Data Lakes

A data lake is a centralized repository that stores data in its raw, native format. Data structure is applied at read time (schema-on-read) rather than write time.

  • Structure: Schema-on-read; data is stored as-is without predefined structure.
  • Data type: Structured, semi-structured (JSON, XML), and unstructured (images, videos, logs, PDFs).
  • Optimization: Optimized for storage volume and data variety rather than query speed.
  • Users: Data scientists, data engineers, ML engineers who need access to raw data.
  • Examples: Amazon S3 + Athena, Azure Data Lake Storage, Google Cloud Storage + BigQuery, Hadoop HDFS.
  • Strengths: Handles any data type, very cost-effective for large volumes, flexible.
  • Weaknesses: Can become a "data swamp" without governance, slower queries on unprocessed data, requires more technical skill.
Feature Data Warehouse Data Lake
Schema Schema-on-write (predefined) Schema-on-read (flexible)
Data Types Structured only Structured, semi-structured, unstructured
Cost Higher (compute + storage tightly coupled) Lower (cheap object storage)
Query Speed Fast (optimized indexes, partitions) Slower unless data is processed
Best For BI dashboards, reporting, SQL analytics ML, data exploration, raw data archival
Governance Strong built-in governance Requires additional tooling

Data Lakehouse

A modern hybrid approach called a data lakehouse combines the flexibility of a data lake with the performance and governance of a data warehouse. Technologies like Delta Lake, Apache Iceberg, and Apache Hudi enable ACID transactions and schema enforcement on top of data lake storage.

File-Based Storage (CSV, Excel, JSON)

For smaller datasets and quick data exchange, file-based formats remain ubiquitous. Every data analyst works with these formats daily.

CSV (Comma-Separated Values)

  • Plain text format where each line is a row and values are separated by commas (or other delimiters like tabs or semicolons).
  • Advantages: Universal compatibility, human-readable, small file size, supported by every programming language and data tool.
  • Limitations: No data type information (everything is text), no support for hierarchical data, no standard for encoding or escaping, large files can be slow to parse.
  • Python: pd.read_csv() and df.to_csv()

Excel (XLSX)

  • Microsoft's spreadsheet format supporting multiple sheets, formulas, formatting, and charts.
  • Advantages: Widely used in business, supports multiple sheets, preserves some data types and formatting.
  • Limitations: Binary format (not human-readable raw), row limit of ~1 million rows, larger file size than CSV, formulas can obscure actual data values.
  • Python: pd.read_excel() and df.to_excel() (requires openpyxl or xlrd)

JSON (JavaScript Object Notation)

  • A lightweight, human-readable format for representing structured and semi-structured data as key-value pairs and arrays.
  • Advantages: Supports nested/hierarchical data, widely used in APIs and web services, language-agnostic.
  • Limitations: Larger file size than CSV for tabular data, parsing nested JSON into flat DataFrames requires additional work, no native support for dates or binary data.
  • Python: pd.read_json(), json.load(), json.loads()

Other Formats

  • Parquet: Columnar storage format optimized for analytics. Very efficient compression and fast query performance. Standard in big data ecosystems. pd.read_parquet()
  • XML (eXtensible Markup Language): Hierarchical, tag-based format common in legacy systems and configuration files.
  • SQLite: A file-based relational database. Useful for local analytics without setting up a server. sqlite3 module in Python.
  • HDF5: Hierarchical Data Format for large numerical datasets, common in scientific computing.

Cloud Storage Solutions

Cloud storage has transformed how organizations store and access data. Instead of maintaining physical servers, data is stored on remote infrastructure managed by cloud providers.

Major Cloud Storage Services

Provider Object Storage Data Warehouse Data Lake
AWS Amazon S3 Amazon Redshift S3 + Athena / Lake Formation
Google Cloud Cloud Storage BigQuery Cloud Storage + BigQuery
Microsoft Azure Blob Storage Azure Synapse Analytics Azure Data Lake Storage

Benefits of Cloud Storage

  • Scalability: Storage grows and shrinks on demand. No need to provision hardware in advance.
  • Cost efficiency: Pay only for what you use. Tiered pricing (hot, warm, cold, archive) lets you optimize costs based on access frequency.
  • Durability and availability: Major providers offer 99.999999999% (11 nines) durability through data replication across regions.
  • Accessibility: Data can be accessed from anywhere with an internet connection, enabling remote teams and global collaboration.
  • Integration: Cloud storage services integrate natively with analytics tools, ML platforms, and data pipelines.
  • Security: Encryption at rest and in transit, IAM (Identity and Access Management), audit logging.

Study Tip

For the exam, understand the conceptual differences between data warehouses and data lakes, and know that CSV and JSON are the most common file-based formats you will work with in Python. You are not expected to be a cloud architect, but you should understand the role of cloud storage in modern data workflows.

Q1: What is the key difference between a data warehouse and a data lake?
A) Data warehouses are cloud-based; data lakes are on-premises only
B) Data lakes can only store structured data
C) Data warehouses use schema-on-write; data lakes use schema-on-read
D) Data lakes require SQL for all queries
Correct: C) Data warehouses enforce structure at write time (schema-on-write) — data is cleaned and transformed before loading. Data lakes store raw data in any format and apply structure when reading (schema-on-read). Both can be cloud-based or on-premises. Data lakes store structured, semi-structured, and unstructured data.
Q2: When using pd.read_csv(), which parameter would you use to handle a file that uses semicolons instead of commas as delimiters?
A) delimiter=","
B) sep=";"
C) encoding="semicolon"
D) header=";"
Correct: B) The sep parameter (or its alias delimiter) specifies the character used to separate fields in the CSV file. By default it is a comma, but many European CSV files use semicolons. Use pd.read_csv("file.csv", sep=";") to handle this.

Python Code Examples

The following examples demonstrate practical implementations of the data collection, integration, and storage concepts discussed above.

Example 1: Web Scraping with BeautifulSoup

This example demonstrates how to fetch a web page and extract structured data from an HTML table.

# Web Scraping with BeautifulSoup import requests from bs4 import BeautifulSoup import pandas as pd # Step 1: Send an HTTP GET request to the target URL url = "https://example.com/data-table" response = requests.get(url) # Step 2: Check that the request was successful if response.status_code == 200: # Step 3: Parse the HTML content soup = BeautifulSoup(response.text, "html.parser") # Step 4: Find the target table table = soup.find("table", {"class": "data-table"}) # Step 5: Extract headers headers = [] for th in table.find_all("th"): headers.append(th.text.strip()) # Step 6: Extract rows rows = [] for tr in table.find_all("tr")[1:]: # Skip header row cells = [td.text.strip() for td in tr.find_all("td")] rows.append(cells) # Step 7: Create a DataFrame df = pd.DataFrame(rows, columns=headers) print(df.head()) # Step 8: Save to CSV for later use df.to_csv("scraped_data.csv", index=False) else: print(f"Request failed with status code: {response.status_code}")

Note

Always check the website's robots.txt file and Terms of Service before scraping. Respect rate limits by adding delays (time.sleep()) between requests. Use response.raise_for_status() for cleaner error handling.

Example 2: Reading from CSV and JSON Files

Reading data from common file formats into pandas DataFrames.

import pandas as pd import json # ---- Reading CSV ---- # Basic CSV reading df_csv = pd.read_csv("sales_data.csv") # CSV with specific options df_csv = pd.read_csv( "sales_data.csv", sep=",", # Delimiter (default is comma) encoding="utf-8", # File encoding parse_dates=["order_date"], # Auto-parse date columns dtype={"zip_code": str}, # Force column types na_values=["N/A", "missing"], # Custom NA markers nrows=1000 # Read only first 1000 rows ) print("CSV Shape:", df_csv.shape) print(df_csv.dtypes) print(df_csv.head()) # ---- Reading JSON ---- # Flat JSON (array of records) df_json = pd.read_json("products.json") # Nested JSON requires json module + pd.json_normalize with open("nested_data.json", "r") as f: raw_data = json.load(f) # Flatten nested structure df_nested = pd.json_normalize( raw_data["results"], record_path=["items"], # Path to nested records meta=["order_id", "customer"], # Include parent fields sep="_" # Separator for nested keys ) print("JSON Shape:", df_nested.shape) print(df_nested.head()) # ---- Reading Excel ---- df_excel = pd.read_excel( "report.xlsx", sheet_name="Q1_Sales", # Specific sheet skiprows=2, # Skip header rows usecols="A:F" # Read only columns A through F ) print("Excel Shape:", df_excel.shape)

Example 3: Combining DataFrames from Multiple Sources

Merging, concatenating, and joining DataFrames to create unified datasets.

import pandas as pd # ---- Create sample DataFrames ---- # Customer data from the CRM system customers = pd.DataFrame({ "customer_id": [101, 102, 103, 104, 105], "name": ["Alice", "Bob", "Charlie", "Diana", "Eve"], "region": ["East", "West", "East", "South", "West"] }) # Order data from the transaction database orders = pd.DataFrame({ "order_id": [1001, 1002, 1003, 1004, 1005], "customer_id": [101, 102, 101, 103, 106], "amount": [250.00, 150.75, 89.99, 320.50, 175.00] }) # ---- MERGE: SQL-style joins ---- # Inner join: only matching customer_ids inner = pd.merge(customers, orders, on="customer_id", how="inner") print("Inner join shape:", inner.shape) # 4 rows (106 excluded) # Left join: all customers, with orders where available left = pd.merge(customers, orders, on="customer_id", how="left") print("Left join shape:", left.shape) # 5 rows (Diana, Eve have NaN orders) # Outer join: all records from both tables outer = pd.merge(customers, orders, on="customer_id", how="outer") print("Outer join shape:", outer.shape) # 6 rows # ---- CONCAT: Stacking DataFrames ---- # Data from different time periods with the same schema q1_sales = pd.DataFrame({ "month": ["Jan", "Feb", "Mar"], "revenue": [45000, 52000, 48000] }) q2_sales = pd.DataFrame({ "month": ["Apr", "May", "Jun"], "revenue": [51000, 49000, 55000] }) # Vertical concatenation (stacking rows) all_sales = pd.concat([q1_sales, q2_sales], ignore_index=True) print(all_sales) # ---- Verify integration quality ---- print("\n--- Integration Quality Checks ---") print("Null values:\n", outer.isnull().sum()) print("Duplicate rows:", outer.duplicated().sum()) print("Shape:", outer.shape)

Example 4: API Data Retrieval with Requests

Fetching data from a REST API, handling pagination, and converting the response to a DataFrame.

import requests import pandas as pd # ---- Basic API Request ---- url = "https://api.example.com/v1/employees" headers = { "Authorization": "Bearer YOUR_API_KEY", "Content-Type": "application/json" } params = { "department": "engineering", "limit": 100 } response = requests.get(url, headers=headers, params=params) response.raise_for_status() # Raise exception for HTTP errors # Parse JSON response data = response.json() df = pd.DataFrame(data["employees"]) print(df.head()) # ---- Handling Paginated API Responses ---- def fetch_all_pages(base_url, headers): """Fetch all pages from a paginated API endpoint.""" all_records = [] page = 1 while True: params = {"page": page, "per_page": 50} response = requests.get(base_url, headers=headers, params=params) response.raise_for_status() data = response.json() records = data.get("results", []) if not records: break # No more data all_records.extend(records) print(f"Fetched page {page} - {len(records)} records") # Check if there are more pages if page >= data.get("total_pages", page): break page += 1 return pd.DataFrame(all_records) # Usage api_url = "https://api.example.com/v1/transactions" df_api = fetch_all_pages(api_url, headers) print(f"Total records fetched: {len(df_api)}") # ---- Error Handling Pattern ---- try: response = requests.get(url, headers=headers, timeout=30) response.raise_for_status() data = response.json() except requests.exceptions.Timeout: print("Request timed out. Try again later.") except requests.exceptions.HTTPError as e: print(f"HTTP error occurred: {e}") except requests.exceptions.RequestException as e: print(f"An error occurred: {e}")

Study Tip

Know the difference between requests.get() and requests.post(). GET is used to retrieve data (read-only), while POST is used to send data to a server (create/update). For the exam, focus on GET requests for data retrieval. Always use response.raise_for_status() or check response.status_code for error handling.

Practice Quiz: Data Collection, Integration, and Storage

Test your understanding of the concepts covered in this topic. Click on the option you believe is correct.

Q1: Which data collection method is most appropriate for gathering in-depth, nuanced understanding of customer experiences?
A) Online survey with closed-ended questions
B) Semi-structured interviews
C) Web scraping of review websites
D) Automated log file analysis
Correct: B) Semi-structured interviews allow the interviewer to explore topics deeply with follow-up questions while maintaining a consistent framework. They produce rich qualitative data that captures nuances, emotions, and context that closed-ended surveys and automated methods cannot provide.
Q2: Under GDPR, which of the following is NOT a right granted to data subjects?
A) Right to access their personal data
B) Right to erasure (right to be forgotten)
C) Right to sell their personal data to third parties
D) Right to data portability
Correct: C) GDPR grants individuals the right to access, rectify, erase, restrict processing, data portability, and object to processing. It does NOT grant individuals the right to sell their own data. GDPR is about protecting personal data, not commercializing it.
Q3: A data analyst needs to combine customer records from a CRM database with order data from a separate transaction database. The datasets share a customer_id field. Which pandas method is most appropriate?
A) pd.merge()
B) pd.concat()
C) df.append()
D) df.update()
Correct: A) pd.merge() performs SQL-style joins on one or more common columns (keys). Since both datasets share customer_id, a merge (e.g., inner, left, or outer join) is the correct approach. pd.concat() is used for stacking DataFrames vertically or horizontally, not for key-based joins.
Q4: What is the key difference between a data warehouse and a data lake?
A) Data warehouses are cloud-based; data lakes are on-premises only
B) Data lakes can only store structured data
C) Data warehouses use schema-on-write; data lakes use schema-on-read
D) Data lakes require SQL for all queries
Correct: C) Data warehouses enforce structure at write time (schema-on-write) — data is cleaned and transformed before loading. Data lakes store raw data in any format and apply structure when reading (schema-on-read). Both can be cloud-based or on-premises. Data lakes store structured, semi-structured, and unstructured data.
Q5: Which Python library is used to parse HTML content when web scraping?
A) requests
B) BeautifulSoup
C) pandas
D) json
Correct: B) BeautifulSoup (from the bs4 package) is used to parse and navigate HTML/XML documents. The requests library is used to fetch the web page (HTTP request), but it does not parse the HTML structure. After fetching with requests, you pass the content to BeautifulSoup for parsing.
Q6: A researcher collects data only from people who voluntarily visit a shopping mall on a Saturday. This is an example of:
A) Simple random sampling
B) Stratified sampling
C) Convenience sampling
D) Cluster sampling
Correct: C) Convenience sampling selects participants based on ease of access rather than a systematic or random method. Surveying only mall visitors on a Saturday excludes people who do not visit malls, work on Saturdays, or live far away — making the sample non-representative of the general population.
Q7: Which of the following is an example of pseudonymization rather than true anonymization?
A) Removing all names and replacing them with nothing
B) Replacing patient names with coded IDs while keeping a separate mapping file
C) Reporting only aggregate statistics (averages, totals) without individual records
D) Generalizing exact ages into 10-year age brackets
Correct: B) Pseudonymization replaces identifiers with pseudonyms (coded IDs) while maintaining a separate key that can re-link the data to individuals. This makes re-identification possible with the key. True anonymization is irreversible — there is no way to trace data back to an individual. Under GDPR, pseudonymized data is still considered personal data.
Q8: When using pd.read_csv(), which parameter would you use to handle a file that uses semicolons instead of commas as delimiters?
A) delimiter=","
B) sep=";"
C) encoding="semicolon"
D) header=";"
Correct: B) The sep parameter (or its alias delimiter) specifies the character used to separate fields in the CSV file. By default it is a comma, but many European CSV files use semicolons. Use pd.read_csv("file.csv", sep=";") to handle this.
Q9: An ETL pipeline has three stages. Which of the following correctly describes the "Transform" stage?
A) Pulling raw data from source databases and APIs
B) Cleaning, reformatting, validating, and standardizing the extracted data
C) Writing the final data into the target data warehouse
D) Backing up the source systems before extraction
Correct: B) In an ETL pipeline: Extract pulls raw data from sources, Transform cleans, reformats, validates, and standardizes the data (handling missing values, type conversions, deduplication, schema alignment), and Load writes the processed data into the target system.
Q10: After performing a left merge of a customers DataFrame (500 rows) with an orders DataFrame on customer_id, the resulting DataFrame has 650 rows. What is the most likely explanation?
A) There are 150 customers with no orders
B) Some customers have multiple orders, creating duplicate customer rows
C) The merge operation added 150 empty rows
D) The orders DataFrame has 150 columns more than the customers DataFrame
Correct: B) In a left merge, every row from the left DataFrame is preserved. If a customer has 3 orders, that customer's row will appear 3 times in the result (once for each matching order). The increase from 500 to 650 rows means 150 additional rows were created by customers who had more than one order. This is a one-to-many relationship.

Navigation

1.1.1 Data Collection 1.1.2 Aggregation & Integration 1.1.3 Storage Solutions Python Examples Practice Quiz