Home Block 1 Block 2 Block 3 Block 4 Block 5

Data Validation and Integrity

Block 1: Data Acquisition and Pre-Processing

Topic 1.3 · 2 Objectives

1.3.1 Data Validation Methods

Learn type checks, range checks, cross-reference validation, early validation benefits, and assert/try-except patterns for data validation.

What Is Data Validation?

Data validation is the process of checking whether data conforms to a set of predefined rules, constraints, or expectations before it is used in analysis, stored in a database, or passed to downstream systems. Validation acts as a gatekeeper: it prevents corrupt, incomplete, or nonsensical data from entering your pipeline.

At its core, validation answers one question: "Does this data look the way it should?" The three most common categories of validation are:

Validation TypeWhat It ChecksExample
Type validationData is of the correct type (int, float, str, date, etc.)Age should be an integer, not a string
Range validationValues fall within an acceptable rangeTemperature between −50 and 60 °C
Cross-field validationLogical consistency between two or more fieldsend_date must be after start_date
Exam Note: This course expects you to know the difference between type, range, and cross-field validation and to identify the appropriate method for a given scenario.

Validation Tools and Approaches

In Python-based data workflows, validation is typically implemented through:

Type Checks: Ensuring Correct Data Types

Type validation ensures that every value in your dataset is of the expected Python or pandas type. When data is loaded from CSVs, APIs, or user input, types are often inferred — and frequently inferred incorrectly. A column that should contain integers may be read as strings. A date column may come through as plain text.

Using isinstance() in Python

The built-in isinstance() function checks whether an object belongs to a specified type (or tuple of types). It returns True or False.

# Basic isinstance() checks age = 25 name = "Alice" salary = 55000.50 print(isinstance(age, int)) # True print(isinstance(name, str)) # True print(isinstance(salary, float)) # True # Check against multiple types print(isinstance(age, (int, float))) # True - accepts either # Validation function using isinstance def validate_record(record): """Validate a single data record.""" errors = [] if not isinstance(record['id'], int): errors.append("id must be an integer") if not isinstance(record['name'], str): errors.append("name must be a string") if not isinstance(record['score'], (int, float)): errors.append("score must be numeric") return errors record = {'id': 1, 'name': "Alice", 'score': 92.5} print(validate_record(record)) # [] (no errors) bad_record = {'id': "one", 'name': 42, 'score': "high"} print(validate_record(bad_record)) # ['id must be an integer', 'name must be a string', 'score must be numeric']

Using pandas dtype Checks

When working with DataFrames, you validate types through the .dtypes attribute or the .select_dtypes() method. You can also enforce types with .astype().

import pandas as pd df = pd.DataFrame({ 'employee_id': [101, 102, 103], 'name': ['Alice', 'Bob', 'Charlie'], 'salary': [55000, 62000, 48000], 'hire_date': ['2020-01-15', '2019-06-01', '2021-03-20'] }) # Inspect current dtypes print(df.dtypes) # employee_id int64 # name object # salary int64 # hire_date object <-- should be datetime! # Define expected dtypes expected_dtypes = { 'employee_id': 'int64', 'name': 'object', 'salary': 'int64', 'hire_date': 'datetime64[ns]' } # Validate and fix dtypes def validate_dtypes(df, expected): """Check if DataFrame columns match expected dtypes.""" mismatches = {} for col, dtype in expected.items(): if col not in df.columns: mismatches[col] = "column missing" elif str(df[col].dtype) != dtype: mismatches[col] = f"expected {dtype}, got {df[col].dtype}" return mismatches # Convert hire_date to datetime first df['hire_date'] = pd.to_datetime(df['hire_date']) print(validate_dtypes(df, expected_dtypes)) # {} (all match)
Best Practice: Always check and explicitly convert dtypes immediately after loading data. Use pd.to_datetime() for dates, pd.to_numeric() for numbers, and .astype() for categorical or string conversions.

Range Checks: Ensuring Values Fall Within Expected Ranges

Range validation ensures that numeric or date values fall within acceptable boundaries. Values outside these bounds are likely errors — for example, a human age of 250 or a percentage greater than 100.

import pandas as pd # Simple range check function def validate_range(value, min_val, max_val, field_name="value"): """Check if a value is within the expected range.""" if value is None: return f"{field_name} is missing (None)" if not (min_val <= value <= max_val): return f"{field_name}={value} is out of range [{min_val}, {max_val}]" return None # No error # Test the function print(validate_range(25, 0, 120, "age")) # None (valid) print(validate_range(-5, 0, 120, "age")) # "age=-5 is out of range [0, 120]" print(validate_range(105, 0, 100, "score")) # "score=105 is out of range [0, 100]" # Range validation on a pandas DataFrame df = pd.DataFrame({ 'student': ['Alice', 'Bob', 'Charlie', 'Diana'], 'age': [20, -3, 22, 200], 'grade': [85, 92, 110, 78] }) # Flag rows with out-of-range ages (valid: 0-120) invalid_age = df[(df['age'] < 0) | (df['age'] > 120)] print("Invalid ages:") print(invalid_age) # student age grade # 1 Bob -3 92 # 3 Diana 200 78 # Flag rows with out-of-range grades (valid: 0-100) invalid_grade = df[(df['grade'] < 0) | (df['grade'] > 100)] print("Invalid grades:") print(invalid_grade) # student age grade # 2 Charlie 22 110 # Using .between() for cleaner range checks df['age_valid'] = df['age'].between(0, 120) df['grade_valid'] = df['grade'].between(0, 100) print(df)
Common Pitfall: Be careful with floating-point comparisons. A value like 99.999999999 might technically be within range but could indicate a precision issue. Always consider how your data is produced when setting boundaries.

Cross-Reference Checks: Validating Data Against Other Fields

Cross-field (or cross-reference) validation checks the logical relationship between two or more columns in a dataset. Unlike type or range checks that examine a single field in isolation, cross-field checks ensure that the combination of values makes sense.

Common examples include:

import pandas as pd # Cross-field validation: end_date > start_date df = pd.DataFrame({ 'project': ['Alpha', 'Beta', 'Gamma', 'Delta'], 'start_date': pd.to_datetime( ['2024-01-01', '2024-03-15', '2024-06-01', '2024-09-01'] ), 'end_date': pd.to_datetime( ['2024-06-30', '2024-02-28', '2024-12-31', '2024-08-15'] ) }) # Find rows where end_date is before start_date invalid_dates = df[df['end_date'] < df['start_date']] print("Projects with invalid date ranges:") print(invalid_dates) # project start_date end_date # 1 Beta 2024-03-15 2024-02-28 # 3 Delta 2024-09-01 2024-08-15 # Cross-field validation function for multiple rules def validate_cross_fields(df): """Apply multiple cross-field validation rules.""" errors = [] # Rule 1: end_date must be >= start_date mask = df['end_date'] < df['start_date'] if mask.any(): bad_rows = df[mask].index.tolist() errors.append(f"end_date < start_date at rows: {bad_rows}") # Rule 2: budget must be positive if status is 'active' if 'budget' in df.columns and 'status' in df.columns: mask2 = (df['status'] == 'active') & (df['budget'] <= 0) if mask2.any(): bad_rows = df[mask2].index.tolist() errors.append(f"Active projects with non-positive budget at rows: {bad_rows}") return errors errors = validate_cross_fields(df) for e in errors: print(f"VALIDATION ERROR: {e}")
Real-World Pattern: In production pipelines, cross-field validation rules are often stored in a configuration file or database table so they can be updated without changing code. Each rule is a function or expression that returns True/False for each row.

Benefits of Early Type Checks in Ingestion Scripts

Performing validation as early as possible in a data pipeline — ideally during ingestion (the point where data first enters your system) — offers several critical advantages:

BenefitExplanation
Fail fastBad data is caught immediately, before it contaminates downstream processes or storage
Lower debugging costErrors detected at the source are much easier to trace and fix than errors found in final reports
Data trustAnalysts and stakeholders can trust that data passing validation is clean and consistent
Reduced reworkCatching a type mismatch during ingestion prevents hours of wasted analysis on corrupt data
Cleaner pipelinesDownstream code can assume correct types, reducing defensive programming throughout the codebase
# Example: validation during data ingestion import pandas as pd def ingest_csv(filepath, schema): """Load a CSV with upfront validation. Parameters: filepath (str): Path to the CSV file schema (dict): Expected column names mapped to dtypes Returns: pd.DataFrame: Validated DataFrame Raises: ValueError: If validation fails """ df = pd.read_csv(filepath) # Step 1: Check all required columns exist missing_cols = set(schema.keys()) - set(df.columns) if missing_cols: raise ValueError(f"Missing columns: {missing_cols}") # Step 2: Convert and validate dtypes for col, dtype in schema.items(): try: if dtype == 'datetime': df[col] = pd.to_datetime(df[col]) elif dtype == 'numeric': df[col] = pd.to_numeric(df[col]) else: df[col] = df[col].astype(dtype) except (ValueError, TypeError) as e: raise ValueError(f"Column '{col}' failed {dtype} conversion: {e}") # Step 3: Check for unexpected nulls null_counts = df[list(schema.keys())].isnull().sum() cols_with_nulls = null_counts[null_counts > 0] if not cols_with_nulls.empty: print(f"WARNING: Null values found:\n{cols_with_nulls}") return df # Usage schema = { 'employee_id': 'int64', 'name': 'object', 'salary': 'numeric', 'hire_date': 'datetime' } # df = ingest_csv('employees.csv', schema)

Using Assert Statements and Try/Except for Validation

Python provides two built-in mechanisms that are commonly used for data validation:

Assert Statements

An assert statement tests a condition and raises an AssertionError if the condition is False. Assertions are ideal for catching programming errors and enforcing assumptions during development.

# Assert statements for quick validation checks # Basic assertion data = {'age': 25, 'name': 'Alice', 'score': 88} assert isinstance(data['age'], int), "Age must be an integer" assert 0 <= data['age'] <= 120, "Age must be between 0 and 120" assert len(data['name']) > 0, "Name cannot be empty" assert 0 <= data['score'] <= 100, "Score must be between 0 and 100" # Schema validation with assertions def validate_schema(df, required_columns): """Assert that a DataFrame contains all required columns.""" for col in required_columns: assert col in df.columns, f"Missing required column: {col}" print("Schema validation passed!") # DataFrame shape assertions import pandas as pd df = pd.DataFrame({'a': [1, 2], 'b': [3, 4]}) assert df.shape[0] > 0, "DataFrame must not be empty" assert df.shape[1] == 2, "DataFrame must have exactly 2 columns"
Important Caveat: assert statements are removed when Python runs with the -O (optimize) flag. Never use assert for validation that must always run in production. Use explicit if/raise instead.

Try/Except for Robust Validation

The try/except pattern handles validation errors gracefully without crashing the program. This is essential in production data pipelines where you want to log errors and continue processing valid records.

import pandas as pd # Try/except for type conversion validation def safe_convert_to_numeric(series, column_name): """Attempt to convert a Series to numeric, returning a report.""" try: converted = pd.to_numeric(series, errors='raise') print(f"Column '{column_name}': successfully converted to numeric") return converted except ValueError as e: print(f"Column '{column_name}': conversion failed - {e}") # Coerce bad values to NaN instead of failing converted = pd.to_numeric(series, errors='coerce') bad_count = converted.isna().sum() - series.isna().sum() print(f" {bad_count} values coerced to NaN") return converted # Example with mixed data s = pd.Series(['10', '20', 'thirty', '40', 'N/A']) result = safe_convert_to_numeric(s, 'quantity') print(result) # 0 10.0 # 1 20.0 # 2 NaN # 3 40.0 # 4 NaN # Comprehensive validation with try/except def validate_and_clean(records): """Validate a list of records, returning valid and invalid sets.""" valid = [] invalid = [] for i, rec in enumerate(records): try: # Type checks assert isinstance(rec['id'], int), "id must be int" assert isinstance(rec['value'], (int, float)), "value must be numeric" # Range checks assert rec['value'] >= 0, "value must be non-negative" valid.append(rec) except (AssertionError, KeyError) as e: invalid.append({'index': i, 'record': rec, 'error': str(e)}) print(f"Valid: {len(valid)}, Invalid: {len(invalid)}") return valid, invalid records = [ {'id': 1, 'value': 100}, {'id': 'two', 'value': 200}, # bad id type {'id': 3, 'value': -50}, # negative value {'id': 4, 'value': 400}, ] valid, invalid = validate_and_clean(records) # Valid: 2, Invalid: 2

1.3.2 Data Integrity

Understand entity, referential, and domain integrity; schema validation; and data quality frameworks.

Concept of Data Integrity and Its Importance

Data integrity refers to the accuracy, consistency, and reliability of data throughout its lifecycle — from creation and storage through processing and retrieval. While validation checks individual values, integrity focuses on the broader question: "Can we trust this data as a whole?"

Data integrity encompasses four key dimensions:

Integrity TypeDefinitionEnforced By
Entity integrityEvery row is uniquely identifiablePrimary keys
Referential integrityRelationships between tables are consistentForeign keys
Domain integrityValues belong to a defined set of acceptable valuesConstraints, enums, check rules
User-defined integrityBusiness-specific rules are satisfiedCustom validation logic
Why Data Integrity Matters: Without integrity, analysis results are unreliable. A broken foreign key means your JOIN produces incorrect counts. A duplicated primary key means aggregations are inflated. Violations cascade through every query and report that touches the data.

Referential Integrity in Databases

Referential integrity ensures that relationships between tables remain valid. If table B references table A through a foreign key, then every foreign key value in table B must correspond to an existing primary key value in table A.

Common violations include:

import pandas as pd # Simulating referential integrity checks with pandas # Parent table: customers customers = pd.DataFrame({ 'customer_id': [1, 2, 3, 4], 'name': ['Alice', 'Bob', 'Charlie', 'Diana'] }) # Child table: orders orders = pd.DataFrame({ 'order_id': [101, 102, 103, 104, 105], 'customer_id': [1, 2, 5, 3, 99], # 5 and 99 don't exist! 'amount': [250, 180, 340, 120, 500] }) # Check referential integrity: all order customer_ids must exist in customers valid_ids = set(customers['customer_id']) order_ids = set(orders['customer_id']) orphan_ids = order_ids - valid_ids if orphan_ids: print(f"REFERENTIAL INTEGRITY VIOLATION!") print(f"Orphan customer_ids in orders: {orphan_ids}") orphan_orders = orders[orders['customer_id'].isin(orphan_ids)] print(orphan_orders) else: print("Referential integrity: OK") # Output: # REFERENTIAL INTEGRITY VIOLATION! # Orphan customer_ids in orders: {99, 5} # order_id customer_id amount # 2 103 5 340 # 4 105 99 500 # Reusable referential integrity checker def check_referential_integrity(child_df, child_col, parent_df, parent_col): """Check if all values in child_col exist in parent_col.""" child_values = set(child_df[child_col].dropna()) parent_values = set(parent_df[parent_col]) orphans = child_values - parent_values if orphans: return { 'valid': False, 'orphan_values': orphans, 'orphan_count': len(orphans) } return {'valid': True}

Entity Integrity (Primary Keys)

Entity integrity requires that every table has a primary key and that the primary key is unique and not null. This ensures that every record can be uniquely identified. Violations include duplicate IDs and null primary keys.

import pandas as pd # Entity integrity checks df = pd.DataFrame({ 'employee_id': [1, 2, 3, 2, None, 5], # duplicate 2, null value 'name': ['Alice', 'Bob', 'Charlie', 'Bob2', 'Eve', 'Frank'], 'department': ['HR', 'IT', 'IT', 'Sales', 'HR', 'IT'] }) def check_entity_integrity(df, pk_column): """Verify entity integrity for a given primary key column.""" issues = [] # Check 1: Primary key must not contain nulls null_count = df[pk_column].isna().sum() if null_count > 0: issues.append(f"NULL values in primary key: {null_count} row(s)") # Check 2: Primary key must be unique duplicate_count = df[pk_column].duplicated().sum() if duplicate_count > 0: dupes = df[df[pk_column].duplicated(keep=False)] issues.append(f"Duplicate primary keys: {duplicate_count} duplicate row(s)") issues.append(f"Duplicate values: {df[pk_column][df[pk_column].duplicated()].tolist()}") if issues: print("ENTITY INTEGRITY VIOLATIONS:") for issue in issues: print(f" - {issue}") else: print("Entity integrity: OK") return len(issues) == 0 check_entity_integrity(df, 'employee_id') # ENTITY INTEGRITY VIOLATIONS: # - NULL values in primary key: 1 row(s) # - Duplicate primary keys: 1 duplicate row(s) # - Duplicate values: [2.0]

Domain Integrity (Valid Values)

Domain integrity ensures that all values in a column come from a defined set of acceptable values (the "domain"). This includes data type restrictions, allowable value ranges, format patterns, and enumerated sets.

import pandas as pd # Domain integrity: Allowed values for categorical columns VALID_DEPARTMENTS = {'HR', 'IT', 'Sales', 'Marketing', 'Finance'} VALID_STATUSES = {'active', 'inactive', 'on_leave'} df = pd.DataFrame({ 'name': ['Alice', 'Bob', 'Charlie', 'Diana'], 'department': ['HR', 'IT', 'Accounting', 'Sales'], # 'Accounting' not valid! 'status': ['active', 'fired', 'active', 'on_leave'], # 'fired' not valid! 'salary': [55000, 62000, -1000, 48000] # negative salary! }) def check_domain_integrity(df, domain_rules): """Check domain integrity against a set of rules. domain_rules: dict with column names as keys and either: - a set of valid values (for categorical columns) - a tuple (min, max) for numeric ranges """ violations = [] for col, rule in domain_rules.items(): if col not in df.columns: violations.append(f"Column '{col}' not found") continue if isinstance(rule, set): # Categorical check: values must be in the allowed set invalid = df[~df[col].isin(rule)] if not invalid.empty: bad_vals = invalid[col].unique().tolist() violations.append( f"'{col}' contains invalid values: {bad_vals}" ) elif isinstance(rule, tuple) and len(rule) == 2: # Numeric range check min_val, max_val = rule invalid = df[~df[col].between(min_val, max_val)] if not invalid.empty: violations.append( f"'{col}' has {len(invalid)} values outside [{min_val}, {max_val}]" ) return violations # Define domain rules rules = { 'department': VALID_DEPARTMENTS, 'status': VALID_STATUSES, 'salary': (0, 500000) } violations = check_domain_integrity(df, rules) for v in violations: print(f"DOMAIN VIOLATION: {v}") # DOMAIN VIOLATION: 'department' contains invalid values: ['Accounting'] # DOMAIN VIOLATION: 'status' contains invalid values: ['fired'] # DOMAIN VIOLATION: 'salary' has 1 values outside [0, 500000]

Schema Validation Approaches

A data schema defines the expected structure of a dataset: column names, data types, constraints, and relationships. Schema validation checks incoming data against this specification to ensure conformity.

Schema validation typically covers:

import pandas as pd # Comprehensive schema validation for DataFrames class DataFrameSchema: """A simple schema validator for pandas DataFrames.""" def __init__(self): self.columns = {} def add_column(self, name, dtype=None, nullable=True, unique=False, allowed_values=None, min_val=None, max_val=None): """Define expectations for a column.""" self.columns[name] = { 'dtype': dtype, 'nullable': nullable, 'unique': unique, 'allowed_values': allowed_values, 'min_val': min_val, 'max_val': max_val } return self # enable method chaining def validate(self, df): """Validate a DataFrame against this schema.""" errors = [] for col_name, rules in self.columns.items(): # Check column exists if col_name not in df.columns: errors.append(f"Missing column: '{col_name}'") continue col = df[col_name] # Check nullable if not rules['nullable'] and col.isna().any(): errors.append(f"'{col_name}' contains nulls but is non-nullable") # Check unique if rules['unique'] and col.duplicated().any(): errors.append(f"'{col_name}' contains duplicate values but must be unique") # Check allowed values if rules['allowed_values'] is not None: invalid = col[~col.isin(rules['allowed_values']) & col.notna()] if not invalid.empty: errors.append( f"'{col_name}' has invalid values: {invalid.unique().tolist()}" ) # Check range if rules['min_val'] is not None: below = col[col < rules['min_val']] if not below.empty: errors.append(f"'{col_name}' has values below {rules['min_val']}") if rules['max_val'] is not None: above = col[col > rules['max_val']] if not above.empty: errors.append(f"'{col_name}' has values above {rules['max_val']}") return errors # Define and use the schema schema = DataFrameSchema() schema.add_column('id', nullable=False, unique=True) schema.add_column('name', nullable=False) schema.add_column('age', min_val=0, max_val=120) schema.add_column('status', allowed_values=['active', 'inactive']) df = pd.DataFrame({ 'id': [1, 2, 2, 4], 'name': ['Alice', None, 'Charlie', 'Diana'], 'age': [25, 130, 35, -5], 'status': ['active', 'inactive', 'pending', 'active'] }) errors = schema.validate(df) for e in errors: print(f"SCHEMA ERROR: {e}") # SCHEMA ERROR: 'id' contains duplicate values but must be unique # SCHEMA ERROR: 'name' contains nulls but is non-nullable # SCHEMA ERROR: 'age' has values above 120 # SCHEMA ERROR: 'age' has values below 0 # SCHEMA ERROR: 'status' has invalid values: ['pending']

Data Quality Frameworks

A data quality framework provides a structured approach to measuring, monitoring, and improving data quality across an organization. Rather than ad-hoc checks, a framework defines consistent dimensions of quality that all data should meet.

The most widely recognized data quality dimensions are:

DimensionDefinitionExample Check
AccuracyData correctly represents the real-world entityCustomer address matches postal records
CompletenessAll required data is presentNo null values in mandatory fields
ConsistencyData does not contradict itself across sourcesTotal in summary matches sum of line items
TimelinessData is available when needed and is currentRecords are updated within 24 hours of change
UniquenessNo unintended duplicates existEach customer appears exactly once
ValidityData conforms to the expected format and rulesEmail addresses match a valid pattern
import pandas as pd # Data quality assessment function def assess_data_quality(df, pk_column=None): """Generate a data quality report for a DataFrame.""" report = {} # 1. Completeness: percentage of non-null values per column completeness = ((1 - df.isnull().mean()) * 100).round(2) report['completeness'] = completeness.to_dict() # 2. Uniqueness: percentage of unique values per column uniqueness = ((df.nunique() / len(df)) * 100).round(2) report['uniqueness'] = uniqueness.to_dict() # 3. Entity integrity: duplicate check on primary key if pk_column and pk_column in df.columns: dup_count = df[pk_column].duplicated().sum() null_pk = df[pk_column].isna().sum() report['entity_integrity'] = { 'duplicates': int(dup_count), 'null_keys': int(null_pk), 'valid': dup_count == 0 and null_pk == 0 } # 4. Row count and shape report['total_rows'] = len(df) report['total_columns'] = len(df.columns) return report # Example usage df = pd.DataFrame({ 'id': [1, 2, 3, 4, 5], 'name': ['Alice', 'Bob', None, 'Diana', 'Eve'], 'score': [85, 92, 78, None, 95], 'dept': ['HR', 'IT', 'IT', 'HR', 'IT'] }) quality = assess_data_quality(df, pk_column='id') print("=== Data Quality Report ===") print(f"Rows: {quality['total_rows']}, Columns: {quality['total_columns']}") print(f"\nCompleteness (% non-null):") for col, pct in quality['completeness'].items(): print(f" {col}: {pct}%") print(f"\nEntity Integrity: {quality['entity_integrity']}")

Putting It All Together: Data Integrity Checks Before Database Insertion

In a real pipeline, you validate data comprehensively before inserting it into a database. Here is a complete example that combines type, range, cross-field, entity, domain, and referential checks:

import pandas as pd def validate_for_insertion(df): """Run all validation checks before inserting data into a database. Returns a tuple: (is_valid: bool, errors: list, clean_df: DataFrame) """ errors = [] # 1. Entity integrity - check primary key if df['id'].isna().any(): errors.append("Primary key 'id' contains NULL values") if df['id'].duplicated().any(): dupes = df[df['id'].duplicated()]['id'].tolist() errors.append(f"Duplicate IDs found: {dupes}") # 2. Type validation try: df['salary'] = pd.to_numeric(df['salary'], errors='raise') except ValueError: errors.append("'salary' contains non-numeric values") # 3. Range validation if (df['salary'] < 0).any(): errors.append("Negative salary values detected") # 4. Domain validation valid_depts = {'HR', 'IT', 'Sales', 'Marketing'} invalid_depts = set(df['department'].dropna()) - valid_depts if invalid_depts: errors.append(f"Invalid departments: {invalid_depts}") # 5. Cross-field validation if 'start_date' in df.columns and 'end_date' in df.columns: bad_dates = df[df['end_date'] < df['start_date']] if not bad_dates.empty: errors.append(f"end_date < start_date in {len(bad_dates)} rows") # 6. Completeness check required_cols = ['id', 'name', 'department', 'salary'] for col in required_cols: null_count = df[col].isna().sum() if null_count > 0: errors.append(f"Required column '{col}' has {null_count} null(s)") is_valid = len(errors) == 0 return is_valid, errors, df # Test it df = pd.DataFrame({ 'id': [1, 2, 3], 'name': ['Alice', 'Bob', 'Charlie'], 'department': ['HR', 'IT', 'Sales'], 'salary': [55000, 62000, 48000] }) is_valid, errors, clean_df = validate_for_insertion(df) if is_valid: print("All checks passed. Safe to insert into database.") else: print("Validation failed:") for e in errors: print(f" - {e}")
Exam Tip: You may encounter scenarios where you need to identify which type of integrity is being violated (entity, referential, or domain). Focus on the definitions: entity = unique identifiers, referential = valid relationships, domain = valid values within a column.

Practice Quiz — Topic 1.3

Test your understanding with 10 multiple-choice questions. Click an option to see the answer and explanation.

Q1. Which validation method checks whether a value is of the expected data type (e.g., integer, string, float)?
A) Range validation
B) Type validation
C) Cross-field validation
D) Referential validation
Correct: B) Type validation. Type validation uses checks like isinstance() or pandas .dtypes to ensure values are of the correct data type. Range validation checks boundaries, cross-field checks relationships between columns, and referential validation checks foreign key relationships.
Q2. What will isinstance(3.14, (int, float)) return?
A) True
B) False
C) TypeError
D) None
Correct: A) True. When a tuple of types is passed as the second argument, isinstance() returns True if the object is an instance of any type in the tuple. Since 3.14 is a float, and float is in the tuple, the result is True.
Q3. You have an orders table where each order references a customer_id. Some orders reference customer IDs that do not exist in the customers table. Which type of integrity is violated?
A) Entity integrity
B) Referential integrity
C) Domain integrity
D) Type integrity
Correct: B) Referential integrity. Referential integrity ensures that foreign key values in a child table correspond to valid primary key values in the parent table. Orders referencing non-existent customers are "orphan records" that violate referential integrity.
Q4. What is the primary risk of using assert for validation in production code?
A) Assert statements are slower than if/else
B) Assert statements cannot include custom error messages
C) Assert statements are disabled when Python runs with the -O flag
D) Assert statements can only check boolean values
Correct: C) When Python is invoked with the -O (optimize) flag, all assert statements are stripped from the bytecode and never executed. This means validation logic inside assert silently disappears in optimized production builds. Use explicit if/raise for critical validation.
Q5. A dataset has a start_date and end_date column. You need to ensure that the end date is always after the start date. Which validation method applies?
A) Type validation
B) Range validation
C) Cross-field validation
D) Domain validation
Correct: C) Cross-field validation. Cross-field (cross-reference) validation checks the logical relationship between two or more fields. Verifying that end_date > start_date requires comparing values across two columns, which is the definition of cross-field validation.
Q6. Which of the following is an example of entity integrity violation?
A) A salary column contains the value "N/A"
B) Two rows in a table have the same primary key value
C) An order references a customer_id that does not exist
D) A temperature reading is 999 degrees Celsius
Correct: B) Two rows with the same primary key. Entity integrity requires that every row has a unique, non-null primary key. Duplicate primary keys mean two records cannot be distinguished, violating entity integrity. Option A is a domain issue, C is referential, and D is a range issue.
Q7. What does the pandas method df['age'].between(0, 120) return?
A) A filtered DataFrame with only rows where age is between 0 and 120
B) A boolean Series indicating whether each value is between 0 and 120 (inclusive)
C) The count of values between 0 and 120
D) A new Series with values clipped to the range [0, 120]
Correct: B) A boolean Series. The .between() method returns a boolean Series of the same length as the original, with True where the value falls within the specified range (inclusive by default) and False otherwise. It does not filter or modify the data.
Q8. A column called status should only contain values from the set {"active", "inactive", "on_leave"}. A row contains the value "terminated". Which integrity type is violated?
A) Entity integrity
B) Referential integrity
C) Domain integrity
D) Cross-field integrity
Correct: C) Domain integrity. Domain integrity ensures that all values in a column come from a defined set of acceptable values. "terminated" is not in the allowed set, so it violates the domain constraint for the status column.
Q9. What is the main benefit of performing data validation early in the data pipeline (at ingestion)?
A) It makes the code run faster
B) It eliminates the need for database constraints
C) It prevents bad data from propagating through downstream processes
D) It automatically corrects invalid data
Correct: C) It prevents bad data from propagating downstream. Early validation ("fail fast") catches errors at the source before they contaminate storage, analysis, and reports. It does not automatically fix data or replace database constraints, but it dramatically reduces debugging costs and rework.
Q10. In the following code, what happens when the conversion fails?
try: df['price'] = pd.to_numeric(df['price'], errors='raise')
except ValueError: df['price'] = pd.to_numeric(df['price'], errors='coerce')
A) The program crashes with a ValueError
B) Non-numeric values are dropped from the DataFrame
C) Non-numeric values are replaced with NaN
D) Non-numeric values are replaced with 0
Correct: C) Non-numeric values are replaced with NaN. The try block attempts strict conversion (errors='raise'), which raises a ValueError if any value cannot be converted. The except block catches that error and re-runs the conversion with errors='coerce', which converts invalid values to NaN instead of failing.

Navigation

1.3.1 Data Validation Methods 1.3.2 Data Integrity Practice Quiz