Topic 1.3: Data Validation and Integrity

1.3.1 Data Validation Methods

Learn type checks, range checks, cross-reference validation, early validation benefits, and assert/try-except patterns for data validation.

What Is Data Validation?

Data validation is the process of checking whether data conforms to a set of predefined rules, constraints, or expectations before it is used in analysis, stored in a database, or passed to downstream systems. Validation acts as a gatekeeper: it prevents corrupt, incomplete, or nonsensical data from entering your pipeline.

At its core, validation answers one question: "Does this data look the way it should?" The three most common categories of validation are:

Validation Type	What It Checks	Example
Type validation	Data is of the correct type (int, float, str, date, etc.)	Age should be an integer, not a string
Range validation	Values fall within an acceptable range	Temperature between −50 and 60 °C
Cross-field validation	Logical consistency between two or more fields	end_date must be after start_date

Exam Note: This course expects you to know the difference between type, range, and cross-field validation and to identify the appropriate method for a given scenario.

Validation Tools and Approaches

In Python-based data workflows, validation is typically implemented through:

Python logic: Custom if statements, assert, and try/except blocks
Schema checks: Defining the expected shape and types of data (column names, dtypes, constraints) and programmatically verifying incoming data against that schema
Library-based validators: Tools like pandas dtype checks, isinstance(), or dedicated validation libraries

Type Checks: Ensuring Correct Data Types

Type validation ensures that every value in your dataset is of the expected Python or pandas type. When data is loaded from CSVs, APIs, or user input, types are often inferred — and frequently inferred incorrectly. A column that should contain integers may be read as strings. A date column may come through as plain text.

Using `isinstance()` in Python

The built-in isinstance() function checks whether an object belongs to a specified type (or tuple of types). It returns True or False.

# Basic isinstance() checks
age = 25
name = "Alice"
salary = 55000.50

print(isinstance(age, int))       # True
print(isinstance(name, str))      # True
print(isinstance(salary, float))  # True

# Check against multiple types
print(isinstance(age, (int, float)))  # True - accepts either

# Validation function using isinstance
def validate_record(record):
    """Validate a single data record."""
    errors = []
    if not isinstance(record['id'], int):
        errors.append("id must be an integer")
    if not isinstance(record['name'], str):
        errors.append("name must be a string")
    if not isinstance(record['score'], (int, float)):
        errors.append("score must be numeric")
    return errors

record = {'id': 1, 'name': "Alice", 'score': 92.5}
print(validate_record(record))  # []  (no errors)

bad_record = {'id': "one", 'name': 42, 'score': "high"}
print(validate_record(bad_record))
# ['id must be an integer', 'name must be a string', 'score must be numeric']

Using pandas dtype Checks

When working with DataFrames, you validate types through the .dtypes attribute or the .select_dtypes() method. You can also enforce types with .astype().

import pandas as pd

df = pd.DataFrame({
    'employee_id': [101, 102, 103],
    'name': ['Alice', 'Bob', 'Charlie'],
    'salary': [55000, 62000, 48000],
    'hire_date': ['2020-01-15', '2019-06-01', '2021-03-20']
})

# Inspect current dtypes
print(df.dtypes)
# employee_id     int64
# name           object
# salary          int64
# hire_date      object   <-- should be datetime!

# Define expected dtypes
expected_dtypes = {
    'employee_id': 'int64',
    'name': 'object',
    'salary': 'int64',
    'hire_date': 'datetime64[ns]'
}

# Validate and fix dtypes
def validate_dtypes(df, expected):
    """Check if DataFrame columns match expected dtypes."""
    mismatches = {}
    for col, dtype in expected.items():
        if col not in df.columns:
            mismatches[col] = "column missing"
        elif str(df[col].dtype) != dtype:
            mismatches[col] = f"expected {dtype}, got {df[col].dtype}"
    return mismatches

# Convert hire_date to datetime first
df['hire_date'] = pd.to_datetime(df['hire_date'])

print(validate_dtypes(df, expected_dtypes))  # {} (all match)

Best Practice: Always check and explicitly convert dtypes immediately after loading data. Use pd.to_datetime() for dates, pd.to_numeric() for numbers, and .astype() for categorical or string conversions.

Range Checks: Ensuring Values Fall Within Expected Ranges

Range validation ensures that numeric or date values fall within acceptable boundaries. Values outside these bounds are likely errors — for example, a human age of 250 or a percentage greater than 100.

import pandas as pd

# Simple range check function
def validate_range(value, min_val, max_val, field_name="value"):
    """Check if a value is within the expected range."""
    if value is None:
        return f"{field_name} is missing (None)"
    if not (min_val <= value <= max_val):
        return f"{field_name}={value} is out of range [{min_val}, {max_val}]"
    return None  # No error

# Test the function
print(validate_range(25, 0, 120, "age"))       # None (valid)
print(validate_range(-5, 0, 120, "age"))       # "age=-5 is out of range [0, 120]"
print(validate_range(105, 0, 100, "score"))   # "score=105 is out of range [0, 100]"

# Range validation on a pandas DataFrame
df = pd.DataFrame({
    'student': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'age': [20, -3, 22, 200],
    'grade': [85, 92, 110, 78]
})

# Flag rows with out-of-range ages (valid: 0-120)
invalid_age = df[(df['age'] < 0) | (df['age'] > 120)]
print("Invalid ages:")
print(invalid_age)
#   student  age  grade
# 1     Bob   -3     92
# 3   Diana  200     78

# Flag rows with out-of-range grades (valid: 0-100)
invalid_grade = df[(df['grade'] < 0) | (df['grade'] > 100)]
print("Invalid grades:")
print(invalid_grade)
#    student  age  grade
# 2  Charlie   22    110

# Using .between() for cleaner range checks
df['age_valid'] = df['age'].between(0, 120)
df['grade_valid'] = df['grade'].between(0, 100)
print(df)

Common Pitfall: Be careful with floating-point comparisons. A value like 99.999999999 might technically be within range but could indicate a precision issue. Always consider how your data is produced when setting boundaries.

Cross-Reference Checks: Validating Data Against Other Fields

Cross-field (or cross-reference) validation checks the logical relationship between two or more columns in a dataset. Unlike type or range checks that examine a single field in isolation, cross-field checks ensure that the combination of values makes sense.

Common examples include:

end_date must be greater than or equal to start_date
min_salary must be less than or equal to max_salary
If status is "shipped", then shipping_date must not be null
discount_price must be less than original_price

import pandas as pd

# Cross-field validation: end_date > start_date
df = pd.DataFrame({
    'project': ['Alpha', 'Beta', 'Gamma', 'Delta'],
    'start_date': pd.to_datetime(
        ['2024-01-01', '2024-03-15', '2024-06-01', '2024-09-01']
    ),
    'end_date': pd.to_datetime(
        ['2024-06-30', '2024-02-28', '2024-12-31', '2024-08-15']
    )
})

# Find rows where end_date is before start_date
invalid_dates = df[df['end_date'] < df['start_date']]
print("Projects with invalid date ranges:")
print(invalid_dates)
#   project start_date   end_date
# 1    Beta 2024-03-15 2024-02-28
# 3   Delta 2024-09-01 2024-08-15

# Cross-field validation function for multiple rules
def validate_cross_fields(df):
    """Apply multiple cross-field validation rules."""
    errors = []

    # Rule 1: end_date must be >= start_date
    mask = df['end_date'] < df['start_date']
    if mask.any():
        bad_rows = df[mask].index.tolist()
        errors.append(f"end_date < start_date at rows: {bad_rows}")

    # Rule 2: budget must be positive if status is 'active'
    if 'budget' in df.columns and 'status' in df.columns:
        mask2 = (df['status'] == 'active') & (df['budget'] <= 0)
        if mask2.any():
            bad_rows = df[mask2].index.tolist()
            errors.append(f"Active projects with non-positive budget at rows: {bad_rows}")

    return errors

errors = validate_cross_fields(df)
for e in errors:
    print(f"VALIDATION ERROR: {e}")

Real-World Pattern: In production pipelines, cross-field validation rules are often stored in a configuration file or database table so they can be updated without changing code. Each rule is a function or expression that returns True/False for each row.

Benefits of Early Type Checks in Ingestion Scripts

Performing validation as early as possible in a data pipeline — ideally during ingestion (the point where data first enters your system) — offers several critical advantages:

Benefit	Explanation
Fail fast	Bad data is caught immediately, before it contaminates downstream processes or storage
Lower debugging cost	Errors detected at the source are much easier to trace and fix than errors found in final reports
Data trust	Analysts and stakeholders can trust that data passing validation is clean and consistent
Reduced rework	Catching a type mismatch during ingestion prevents hours of wasted analysis on corrupt data
Cleaner pipelines	Downstream code can assume correct types, reducing defensive programming throughout the codebase

# Example: validation during data ingestion
import pandas as pd

def ingest_csv(filepath, schema):
    """Load a CSV with upfront validation.

    Parameters:
        filepath (str): Path to the CSV file
        schema (dict): Expected column names mapped to dtypes

    Returns:
        pd.DataFrame: Validated DataFrame

    Raises:
        ValueError: If validation fails
    """
    df = pd.read_csv(filepath)

    # Step 1: Check all required columns exist
    missing_cols = set(schema.keys()) - set(df.columns)
    if missing_cols:
        raise ValueError(f"Missing columns: {missing_cols}")

    # Step 2: Convert and validate dtypes
    for col, dtype in schema.items():
        try:
            if dtype == 'datetime':
                df[col] = pd.to_datetime(df[col])
            elif dtype == 'numeric':
                df[col] = pd.to_numeric(df[col])
            else:
                df[col] = df[col].astype(dtype)
        except (ValueError, TypeError) as e:
            raise ValueError(f"Column '{col}' failed {dtype} conversion: {e}")

    # Step 3: Check for unexpected nulls
    null_counts = df[list(schema.keys())].isnull().sum()
    cols_with_nulls = null_counts[null_counts > 0]
    if not cols_with_nulls.empty:
        print(f"WARNING: Null values found:\n{cols_with_nulls}")

    return df

# Usage
schema = {
    'employee_id': 'int64',
    'name': 'object',
    'salary': 'numeric',
    'hire_date': 'datetime'
}
# df = ingest_csv('employees.csv', schema)

Using Assert Statements and Try/Except for Validation

Python provides two built-in mechanisms that are commonly used for data validation:

Assert Statements

An assert statement tests a condition and raises an AssertionError if the condition is False. Assertions are ideal for catching programming errors and enforcing assumptions during development.

# Assert statements for quick validation checks

# Basic assertion
data = {'age': 25, 'name': 'Alice', 'score': 88}

assert isinstance(data['age'], int), "Age must be an integer"
assert 0 <= data['age'] <= 120, "Age must be between 0 and 120"
assert len(data['name']) > 0, "Name cannot be empty"
assert 0 <= data['score'] <= 100, "Score must be between 0 and 100"

# Schema validation with assertions
def validate_schema(df, required_columns):
    """Assert that a DataFrame contains all required columns."""
    for col in required_columns:
        assert col in df.columns, f"Missing required column: {col}"
    print("Schema validation passed!")

# DataFrame shape assertions
import pandas as pd
df = pd.DataFrame({'a': [1, 2], 'b': [3, 4]})
assert df.shape[0] > 0, "DataFrame must not be empty"
assert df.shape[1] == 2, "DataFrame must have exactly 2 columns"

Important Caveat: assert statements are removed when Python runs with the -O (optimize) flag. Never use assert for validation that must always run in production. Use explicit if/raise instead.

Try/Except for Robust Validation

The try/except pattern handles validation errors gracefully without crashing the program. This is essential in production data pipelines where you want to log errors and continue processing valid records.

import pandas as pd

# Try/except for type conversion validation
def safe_convert_to_numeric(series, column_name):
    """Attempt to convert a Series to numeric, returning a report."""
    try:
        converted = pd.to_numeric(series, errors='raise')
        print(f"Column '{column_name}': successfully converted to numeric")
        return converted
    except ValueError as e:
        print(f"Column '{column_name}': conversion failed - {e}")
        # Coerce bad values to NaN instead of failing
        converted = pd.to_numeric(series, errors='coerce')
        bad_count = converted.isna().sum() - series.isna().sum()
        print(f"  {bad_count} values coerced to NaN")
        return converted

# Example with mixed data
s = pd.Series(['10', '20', 'thirty', '40', 'N/A'])
result = safe_convert_to_numeric(s, 'quantity')
print(result)
# 0    10.0
# 1    20.0
# 2     NaN
# 3    40.0
# 4     NaN

# Comprehensive validation with try/except
def validate_and_clean(records):
    """Validate a list of records, returning valid and invalid sets."""
    valid = []
    invalid = []

    for i, rec in enumerate(records):
        try:
            # Type checks
            assert isinstance(rec['id'], int), "id must be int"
            assert isinstance(rec['value'], (int, float)), "value must be numeric"
            # Range checks
            assert rec['value'] >= 0, "value must be non-negative"
            valid.append(rec)
        except (AssertionError, KeyError) as e:
            invalid.append({'index': i, 'record': rec, 'error': str(e)})

    print(f"Valid: {len(valid)}, Invalid: {len(invalid)}")
    return valid, invalid

records = [
    {'id': 1, 'value': 100},
    {'id': 'two', 'value': 200},   # bad id type
    {'id': 3, 'value': -50},    # negative value
    {'id': 4, 'value': 400},
]
valid, invalid = validate_and_clean(records)
# Valid: 2, Invalid: 2

1.3.2 Data Integrity

Understand entity, referential, and domain integrity; schema validation; and data quality frameworks.

Concept of Data Integrity and Its Importance

Data integrity refers to the accuracy, consistency, and reliability of data throughout its lifecycle — from creation and storage through processing and retrieval. While validation checks individual values, integrity focuses on the broader question: "Can we trust this data as a whole?"

Data integrity encompasses four key dimensions:

Integrity Type	Definition	Enforced By
Entity integrity	Every row is uniquely identifiable	Primary keys
Referential integrity	Relationships between tables are consistent	Foreign keys
Domain integrity	Values belong to a defined set of acceptable values	Constraints, enums, check rules
User-defined integrity	Business-specific rules are satisfied	Custom validation logic

Why Data Integrity Matters: Without integrity, analysis results are unreliable. A broken foreign key means your JOIN produces incorrect counts. A duplicated primary key means aggregations are inflated. Violations cascade through every query and report that touches the data.

Referential Integrity in Databases

Referential integrity ensures that relationships between tables remain valid. If table B references table A through a foreign key, then every foreign key value in table B must correspond to an existing primary key value in table A.

Common violations include:

Orphan records: A row in a child table references a parent record that does not exist (e.g., an order with a customer_id that has no matching customer)
Dangling references: A parent record is deleted while child records still reference it

import pandas as pd

# Simulating referential integrity checks with pandas

# Parent table: customers
customers = pd.DataFrame({
    'customer_id': [1, 2, 3, 4],
    'name': ['Alice', 'Bob', 'Charlie', 'Diana']
})

# Child table: orders
orders = pd.DataFrame({
    'order_id': [101, 102, 103, 104, 105],
    'customer_id': [1, 2, 5, 3, 99],  # 5 and 99 don't exist!
    'amount': [250, 180, 340, 120, 500]
})

# Check referential integrity: all order customer_ids must exist in customers
valid_ids = set(customers['customer_id'])
order_ids = set(orders['customer_id'])
orphan_ids = order_ids - valid_ids

if orphan_ids:
    print(f"REFERENTIAL INTEGRITY VIOLATION!")
    print(f"Orphan customer_ids in orders: {orphan_ids}")
    orphan_orders = orders[orders['customer_id'].isin(orphan_ids)]
    print(orphan_orders)
else:
    print("Referential integrity: OK")

# Output:
# REFERENTIAL INTEGRITY VIOLATION!
# Orphan customer_ids in orders: {99, 5}
#    order_id  customer_id  amount
# 2       103            5     340
# 4       105           99     500

# Reusable referential integrity checker
def check_referential_integrity(child_df, child_col, parent_df, parent_col):
    """Check if all values in child_col exist in parent_col."""
    child_values = set(child_df[child_col].dropna())
    parent_values = set(parent_df[parent_col])
    orphans = child_values - parent_values
    if orphans:
        return {
            'valid': False,
            'orphan_values': orphans,
            'orphan_count': len(orphans)
        }
    return {'valid': True}

Entity Integrity (Primary Keys)

Entity integrity requires that every table has a primary key and that the primary key is unique and not null. This ensures that every record can be uniquely identified. Violations include duplicate IDs and null primary keys.

import pandas as pd

# Entity integrity checks
df = pd.DataFrame({
    'employee_id': [1, 2, 3, 2, None, 5],  # duplicate 2, null value
    'name': ['Alice', 'Bob', 'Charlie', 'Bob2', 'Eve', 'Frank'],
    'department': ['HR', 'IT', 'IT', 'Sales', 'HR', 'IT']
})

def check_entity_integrity(df, pk_column):
    """Verify entity integrity for a given primary key column."""
    issues = []

    # Check 1: Primary key must not contain nulls
    null_count = df[pk_column].isna().sum()
    if null_count > 0:
        issues.append(f"NULL values in primary key: {null_count} row(s)")

    # Check 2: Primary key must be unique
    duplicate_count = df[pk_column].duplicated().sum()
    if duplicate_count > 0:
        dupes = df[df[pk_column].duplicated(keep=False)]
        issues.append(f"Duplicate primary keys: {duplicate_count} duplicate row(s)")
        issues.append(f"Duplicate values: {df[pk_column][df[pk_column].duplicated()].tolist()}")

    if issues:
        print("ENTITY INTEGRITY VIOLATIONS:")
        for issue in issues:
            print(f"  - {issue}")
    else:
        print("Entity integrity: OK")

    return len(issues) == 0

check_entity_integrity(df, 'employee_id')
# ENTITY INTEGRITY VIOLATIONS:
#   - NULL values in primary key: 1 row(s)
#   - Duplicate primary keys: 1 duplicate row(s)
#   - Duplicate values: [2.0]

Domain Integrity (Valid Values)

Domain integrity ensures that all values in a column come from a defined set of acceptable values (the "domain"). This includes data type restrictions, allowable value ranges, format patterns, and enumerated sets.

import pandas as pd

# Domain integrity: Allowed values for categorical columns
VALID_DEPARTMENTS = {'HR', 'IT', 'Sales', 'Marketing', 'Finance'}
VALID_STATUSES = {'active', 'inactive', 'on_leave'}

df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'department': ['HR', 'IT', 'Accounting', 'Sales'],  # 'Accounting' not valid!
    'status': ['active', 'fired', 'active', 'on_leave'],  # 'fired' not valid!
    'salary': [55000, 62000, -1000, 48000]  # negative salary!
})

def check_domain_integrity(df, domain_rules):
    """Check domain integrity against a set of rules.

    domain_rules: dict with column names as keys and either:
        - a set of valid values (for categorical columns)
        - a tuple (min, max) for numeric ranges
    """
    violations = []

    for col, rule in domain_rules.items():
        if col not in df.columns:
            violations.append(f"Column '{col}' not found")
            continue

        if isinstance(rule, set):
            # Categorical check: values must be in the allowed set
            invalid = df[~df[col].isin(rule)]
            if not invalid.empty:
                bad_vals = invalid[col].unique().tolist()
                violations.append(
                    f"'{col}' contains invalid values: {bad_vals}"
                )
        elif isinstance(rule, tuple) and len(rule) == 2:
            # Numeric range check
            min_val, max_val = rule
            invalid = df[~df[col].between(min_val, max_val)]
            if not invalid.empty:
                violations.append(
                    f"'{col}' has {len(invalid)} values outside [{min_val}, {max_val}]"
                )

    return violations

# Define domain rules
rules = {
    'department': VALID_DEPARTMENTS,
    'status': VALID_STATUSES,
    'salary': (0, 500000)
}

violations = check_domain_integrity(df, rules)
for v in violations:
    print(f"DOMAIN VIOLATION: {v}")
# DOMAIN VIOLATION: 'department' contains invalid values: ['Accounting']
# DOMAIN VIOLATION: 'status' contains invalid values: ['fired']
# DOMAIN VIOLATION: 'salary' has 1 values outside [0, 500000]

Schema Validation Approaches

A data schema defines the expected structure of a dataset: column names, data types, constraints, and relationships. Schema validation checks incoming data against this specification to ensure conformity.

Schema validation typically covers:

Column presence: All required columns exist
Column types: Each column has the expected dtype
Non-null constraints: Required fields are not missing
Uniqueness: Columns that should be unique (e.g., IDs) have no duplicates
Value constraints: Domain rules, ranges, and patterns

import pandas as pd

# Comprehensive schema validation for DataFrames
class DataFrameSchema:
    """A simple schema validator for pandas DataFrames."""

    def __init__(self):
        self.columns = {}

    def add_column(self, name, dtype=None, nullable=True,
                   unique=False, allowed_values=None,
                   min_val=None, max_val=None):
        """Define expectations for a column."""
        self.columns[name] = {
            'dtype': dtype,
            'nullable': nullable,
            'unique': unique,
            'allowed_values': allowed_values,
            'min_val': min_val,
            'max_val': max_val
        }
        return self  # enable method chaining

    def validate(self, df):
        """Validate a DataFrame against this schema."""
        errors = []

        for col_name, rules in self.columns.items():
            # Check column exists
            if col_name not in df.columns:
                errors.append(f"Missing column: '{col_name}'")
                continue

            col = df[col_name]

            # Check nullable
            if not rules['nullable'] and col.isna().any():
                errors.append(f"'{col_name}' contains nulls but is non-nullable")

            # Check unique
            if rules['unique'] and col.duplicated().any():
                errors.append(f"'{col_name}' contains duplicate values but must be unique")

            # Check allowed values
            if rules['allowed_values'] is not None:
                invalid = col[~col.isin(rules['allowed_values']) & col.notna()]
                if not invalid.empty:
                    errors.append(
                        f"'{col_name}' has invalid values: {invalid.unique().tolist()}"
                    )

            # Check range
            if rules['min_val'] is not None:
                below = col[col < rules['min_val']]
                if not below.empty:
                    errors.append(f"'{col_name}' has values below {rules['min_val']}")

            if rules['max_val'] is not None:
                above = col[col > rules['max_val']]
                if not above.empty:
                    errors.append(f"'{col_name}' has values above {rules['max_val']}")

        return errors

# Define and use the schema
schema = DataFrameSchema()
schema.add_column('id', nullable=False, unique=True)
schema.add_column('name', nullable=False)
schema.add_column('age', min_val=0, max_val=120)
schema.add_column('status', allowed_values=['active', 'inactive'])

df = pd.DataFrame({
    'id': [1, 2, 2, 4],
    'name': ['Alice', None, 'Charlie', 'Diana'],
    'age': [25, 130, 35, -5],
    'status': ['active', 'inactive', 'pending', 'active']
})

errors = schema.validate(df)
for e in errors:
    print(f"SCHEMA ERROR: {e}")
# SCHEMA ERROR: 'id' contains duplicate values but must be unique
# SCHEMA ERROR: 'name' contains nulls but is non-nullable
# SCHEMA ERROR: 'age' has values above 120
# SCHEMA ERROR: 'age' has values below 0
# SCHEMA ERROR: 'status' has invalid values: ['pending']

Data Quality Frameworks

A data quality framework provides a structured approach to measuring, monitoring, and improving data quality across an organization. Rather than ad-hoc checks, a framework defines consistent dimensions of quality that all data should meet.

The most widely recognized data quality dimensions are:

Dimension	Definition	Example Check
Accuracy	Data correctly represents the real-world entity	Customer address matches postal records
Completeness	All required data is present	No null values in mandatory fields
Consistency	Data does not contradict itself across sources	Total in summary matches sum of line items
Timeliness	Data is available when needed and is current	Records are updated within 24 hours of change
Uniqueness	No unintended duplicates exist	Each customer appears exactly once
Validity	Data conforms to the expected format and rules	Email addresses match a valid pattern

import pandas as pd

# Data quality assessment function
def assess_data_quality(df, pk_column=None):
    """Generate a data quality report for a DataFrame."""
    report = {}

    # 1. Completeness: percentage of non-null values per column
    completeness = ((1 - df.isnull().mean()) * 100).round(2)
    report['completeness'] = completeness.to_dict()

    # 2. Uniqueness: percentage of unique values per column
    uniqueness = ((df.nunique() / len(df)) * 100).round(2)
    report['uniqueness'] = uniqueness.to_dict()

    # 3. Entity integrity: duplicate check on primary key
    if pk_column and pk_column in df.columns:
        dup_count = df[pk_column].duplicated().sum()
        null_pk = df[pk_column].isna().sum()
        report['entity_integrity'] = {
            'duplicates': int(dup_count),
            'null_keys': int(null_pk),
            'valid': dup_count == 0 and null_pk == 0
        }

    # 4. Row count and shape
    report['total_rows'] = len(df)
    report['total_columns'] = len(df.columns)

    return report

# Example usage
df = pd.DataFrame({
    'id': [1, 2, 3, 4, 5],
    'name': ['Alice', 'Bob', None, 'Diana', 'Eve'],
    'score': [85, 92, 78, None, 95],
    'dept': ['HR', 'IT', 'IT', 'HR', 'IT']
})

quality = assess_data_quality(df, pk_column='id')

print("=== Data Quality Report ===")
print(f"Rows: {quality['total_rows']}, Columns: {quality['total_columns']}")
print(f"\nCompleteness (% non-null):")
for col, pct in quality['completeness'].items():
    print(f"  {col}: {pct}%")
print(f"\nEntity Integrity: {quality['entity_integrity']}")

Putting It All Together: Data Integrity Checks Before Database Insertion

In a real pipeline, you validate data comprehensively before inserting it into a database. Here is a complete example that combines type, range, cross-field, entity, domain, and referential checks:

import pandas as pd

def validate_for_insertion(df):
    """Run all validation checks before inserting data into a database.

    Returns a tuple: (is_valid: bool, errors: list, clean_df: DataFrame)
    """
    errors = []

    # 1. Entity integrity - check primary key
    if df['id'].isna().any():
        errors.append("Primary key 'id' contains NULL values")
    if df['id'].duplicated().any():
        dupes = df[df['id'].duplicated()]['id'].tolist()
        errors.append(f"Duplicate IDs found: {dupes}")

    # 2. Type validation
    try:
        df['salary'] = pd.to_numeric(df['salary'], errors='raise')
    except ValueError:
        errors.append("'salary' contains non-numeric values")

    # 3. Range validation
    if (df['salary'] < 0).any():
        errors.append("Negative salary values detected")

    # 4. Domain validation
    valid_depts = {'HR', 'IT', 'Sales', 'Marketing'}
    invalid_depts = set(df['department'].dropna()) - valid_depts
    if invalid_depts:
        errors.append(f"Invalid departments: {invalid_depts}")

    # 5. Cross-field validation
    if 'start_date' in df.columns and 'end_date' in df.columns:
        bad_dates = df[df['end_date'] < df['start_date']]
        if not bad_dates.empty:
            errors.append(f"end_date < start_date in {len(bad_dates)} rows")

    # 6. Completeness check
    required_cols = ['id', 'name', 'department', 'salary']
    for col in required_cols:
        null_count = df[col].isna().sum()
        if null_count > 0:
            errors.append(f"Required column '{col}' has {null_count} null(s)")

    is_valid = len(errors) == 0
    return is_valid, errors, df

# Test it
df = pd.DataFrame({
    'id': [1, 2, 3],
    'name': ['Alice', 'Bob', 'Charlie'],
    'department': ['HR', 'IT', 'Sales'],
    'salary': [55000, 62000, 48000]
})

is_valid, errors, clean_df = validate_for_insertion(df)
if is_valid:
    print("All checks passed. Safe to insert into database.")
else:
    print("Validation failed:")
    for e in errors:
        print(f"  - {e}")

Exam Tip: You may encounter scenarios where you need to identify which type of integrity is being violated (entity, referential, or domain). Focus on the definitions: entity = unique identifiers, referential = valid relationships, domain = valid values within a column.

Practice Quiz — Topic 1.3

Test your understanding with 10 multiple-choice questions. Click an option to see the answer and explanation.

Q1. Which validation method checks whether a value is of the expected data type (e.g., integer, string, float)?

A) Range validation

B) Type validation

C) Cross-field validation

D) Referential validation

Correct: B) Type validation. Type validation uses checks like isinstance() or pandas .dtypes to ensure values are of the correct data type. Range validation checks boundaries, cross-field checks relationships between columns, and referential validation checks foreign key relationships.

Q2. What will isinstance(3.14, (int, float)) return?

A) True

B) False

C) TypeError

D) None

Correct: A) True. When a tuple of types is passed as the second argument, isinstance() returns True if the object is an instance of any type in the tuple. Since 3.14 is a float, and float is in the tuple, the result is True.

Q3. You have an orders table where each order references a customer_id. Some orders reference customer IDs that do not exist in the customers table. Which type of integrity is violated?

A) Entity integrity

B) Referential integrity

C) Domain integrity

D) Type integrity

Correct: B) Referential integrity. Referential integrity ensures that foreign key values in a child table correspond to valid primary key values in the parent table. Orders referencing non-existent customers are "orphan records" that violate referential integrity.

Q4. What is the primary risk of using assert for validation in production code?

A) Assert statements are slower than if/else

B) Assert statements cannot include custom error messages

C) Assert statements are disabled when Python runs with the -O flag

D) Assert statements can only check boolean values

Correct: C) When Python is invoked with the -O (optimize) flag, all assert statements are stripped from the bytecode and never executed. This means validation logic inside assert silently disappears in optimized production builds. Use explicit if/raise for critical validation.

Q5. A dataset has a start_date and end_date column. You need to ensure that the end date is always after the start date. Which validation method applies?

A) Type validation

B) Range validation

C) Cross-field validation

D) Domain validation

Correct: C) Cross-field validation. Cross-field (cross-reference) validation checks the logical relationship between two or more fields. Verifying that end_date > start_date requires comparing values across two columns, which is the definition of cross-field validation.

Q6. Which of the following is an example of entity integrity violation?

A) A salary column contains the value "N/A"

B) Two rows in a table have the same primary key value

C) An order references a customer_id that does not exist

D) A temperature reading is 999 degrees Celsius

Correct: B) Two rows with the same primary key. Entity integrity requires that every row has a unique, non-null primary key. Duplicate primary keys mean two records cannot be distinguished, violating entity integrity. Option A is a domain issue, C is referential, and D is a range issue.

Q7. What does the pandas method df['age'].between(0, 120) return?

A) A filtered DataFrame with only rows where age is between 0 and 120

B) A boolean Series indicating whether each value is between 0 and 120 (inclusive)

C) The count of values between 0 and 120

D) A new Series with values clipped to the range [0, 120]

Correct: B) A boolean Series. The .between() method returns a boolean Series of the same length as the original, with True where the value falls within the specified range (inclusive by default) and False otherwise. It does not filter or modify the data.

Q8. A column called status should only contain values from the set {"active", "inactive", "on_leave"}. A row contains the value "terminated". Which integrity type is violated?

A) Entity integrity

B) Referential integrity

C) Domain integrity

D) Cross-field integrity

Correct: C) Domain integrity. Domain integrity ensures that all values in a column come from a defined set of acceptable values. "terminated" is not in the allowed set, so it violates the domain constraint for the status column.

Q9. What is the main benefit of performing data validation early in the data pipeline (at ingestion)?

A) It makes the code run faster

B) It eliminates the need for database constraints

C) It prevents bad data from propagating through downstream processes

D) It automatically corrects invalid data

Correct: C) It prevents bad data from propagating downstream. Early validation ("fail fast") catches errors at the source before they contaminate storage, analysis, and reports. It does not automatically fix data or replace database constraints, but it dramatically reduces debugging costs and rework.

Q10. In the following code, what happens when the conversion fails?

try: df['price'] = pd.to_numeric(df['price'], errors='raise')
except ValueError: df['price'] = pd.to_numeric(df['price'], errors='coerce')

A) The program crashes with a ValueError

B) Non-numeric values are dropped from the DataFrame

C) Non-numeric values are replaced with NaN

D) Non-numeric values are replaced with 0

Correct: C) Non-numeric values are replaced with NaN. The try block attempts strict conversion (errors='raise'), which raises a ValueError if any value cannot be converted. The except block catches that error and re-runs the conversion with errors='coerce', which converts invalid values to NaN instead of failing.

Previous Topic Next Topic

Data Validation and Integrity