Topic 2.3: Object-Oriented Programming for Data Modeling

2.3.1 Basic OOP for Data Modeling

Object-Oriented Programming (OOP) lets you bundle related data and behavior into classes. In data analysis workflows, classes model real-world entities such as records, responses, configurations, and exporters. Understanding how to define and use classes is essential for the exam.

Defining Classes with the `class` Keyword

A class is a blueprint for creating objects. You define a class using the class keyword, followed by a name (conventionally in PascalCase) and a colon:

class DataRecord:
    """Represents a single data record from a survey."""

    def __init__(self, record_id, values):
        # Constructor: called automatically when an object is created
        self.record_id = record_id    # instance variable
        self.values = values          # instance variable (a dict)
        self.is_valid = True         # default attribute

    def field_count(self):
        """Return the number of fields in this record."""
        return len(self.values)

    def get_field(self, name, default=None):
        """Safely retrieve a field value."""
        return self.values.get(name, default)

    def invalidate(self, reason):
        """Mark the record as invalid."""
        self.is_valid = False
        self.invalid_reason = reason

# Creating instances
rec = DataRecord(101, {"name": "Alice", "score": 87})
print(rec.record_id)          # 101
print(rec.field_count())      # 2
print(rec.get_field("score")) # 87

Key Points for the Exam: __init__ is the constructor — it runs automatically when you create an object. The first parameter is always self, which refers to the instance being created. Every attribute assigned to self becomes an instance variable unique to that object.

Instance Methods

Instance methods are functions defined inside a class that operate on a specific object. They always take self as their first parameter, giving them access to the object's attributes and other methods:

class DataRecord:
    def __init__(self, record_id, values):
        self.record_id = record_id
        self.values = values

    def summary(self):
        """Return a summary string for this record."""
        fields = ", ".join(self.values.keys())
        return f"Record {self.record_id}: [{fields}]"

    def merge(self, other_values):
        """Merge additional fields into the record."""
        self.values.update(other_values)

rec = DataRecord(1, {"age": 30})
rec.merge({"city": "Amman"})
print(rec.summary())  # Record 1: [age, city]

Encapsulation: `_protected` and `__private`

Python uses naming conventions rather than strict access modifiers to signal how attributes should be used:

Convention	Example	Meaning
No prefix	`self.name`	Public — part of the class's external API
Single underscore	`self._cache`	Protected — "internal, don't touch" (not enforced)
Double underscore	`self.__secret`	Private — triggers name mangling (`_ClassName__secret`)

class SensitiveRecord:
    def __init__(self, patient_id, diagnosis):
        self.patient_id = patient_id      # public
        self._diagnosis = diagnosis        # protected (convention only)
        self.__ssn = None                  # private (name-mangled)

    def set_ssn(self, ssn):
        if len(ssn) == 9 and ssn.isdigit():
            self.__ssn = ssn
        else:
            raise ValueError("Invalid SSN format")

    def get_masked_ssn(self):
        if self.__ssn:
            return "***-**-" + self.__ssn[-4:]
        return None

rec = SensitiveRecord("P100", "Flu")
rec.set_ssn("123456789")
print(rec.get_masked_ssn())  # ***-**-6789

# Direct access to __ssn fails:
# print(rec.__ssn)  -> AttributeError

# But name-mangled version still accessible (not truly private):
print(rec._SensitiveRecord__ssn)  # 123456789

Exam Watch: Name mangling with __ double underscores does NOT make an attribute truly private. Python rewrites self.__attr to self._ClassName__attr. This is designed to avoid accidental name collisions in inheritance, not for security.

Getters and Setters with `@property`

The @property decorator lets you expose computed or validated attributes that look like simple attribute access from the outside:

class TemperatureReading:
    def __init__(self, celsius):
        self._celsius = celsius  # store internally

    @property
    def celsius(self):
        """Getter: return the temperature in Celsius."""
        return self._celsius

    @celsius.setter
    def celsius(self, value):
        """Setter: validate before storing."""
        if value < -273.15:
            raise ValueError("Temperature below absolute zero is impossible")
        self._celsius = value

    @property
    def fahrenheit(self):
        """Read-only computed property."""
        return self._celsius * 9 / 5 + 32

t = TemperatureReading(25)
print(t.celsius)      # 25   (uses the getter)
print(t.fahrenheit)   # 77.0 (computed property)

t.celsius = 30        # uses the setter with validation
print(t.fahrenheit)   # 86.0

# t.celsius = -300  -> ValueError
# t.fahrenheit = 50 -> AttributeError (read-only)

Why Use @property? Properties let you add validation, computation, or logging to attribute access without changing the caller's code. Code that used obj.celsius before the property was added continues to work unchanged.

Managing Internal Object State

Objects maintain their own internal state, which methods can modify over time. This pattern is common in data pipelines where records pass through processing stages:

class PipelineRecord:
    def __init__(self, raw_data):
        self.raw_data = raw_data
        self._cleaned = False
        self._validated = False
        self._errors = []

    def clean(self):
        # Strip whitespace from string values
        self.raw_data = {
            k: v.strip() if isinstance(v, str) else v
            for k, v in self.raw_data.items()
        }
        self._cleaned = True

    def validate(self, required_fields):
        for field in required_fields:
            if field not in self.raw_data:
                self._errors.append(f"Missing: {field}")
        self._validated = True

    @property
    def is_ready(self):
        return self._cleaned and self._validated and not self._errors

rec = PipelineRecord({"name": "  Alice  ", "score": 92})
rec.clean()
rec.validate(["name", "score"])
print(rec.is_ready)   # True
print(rec.raw_data)   # {'name': 'Alice', 'score': 92}

2.3.2 OOP Patterns for Analysis Workflows

Real data projects often combine multiple classes. Three key OOP patterns appear in this course: composition, inheritance, and polymorphism.

Composition: Nesting Objects Inside Objects

Composition means one class contains instances of another class as attributes. This is a "has-a" relationship. For example, a Survey has multiple Question objects:

class Question:
    def __init__(self, text, question_type="text"):
        self.text = text
        self.question_type = question_type
        self.responses = []

    def add_response(self, response):
        self.responses.append(response)

    def response_count(self):
        return len(self.responses)


class Survey:
    def __init__(self, title):
        self.title = title
        self.questions = []   # will hold Question objects

    def add_question(self, text, question_type="text"):
        q = Question(text, question_type)
        self.questions.append(q)
        return q

    def total_responses(self):
        return sum(q.response_count() for q in self.questions)

    def summary(self):
        return {
            "title": self.title,
            "num_questions": len(self.questions),
            "total_responses": self.total_responses()
        }


# Usage
survey = Survey("Customer Satisfaction")
q1 = survey.add_question("How would you rate our service?", "rating")
q2 = survey.add_question("Any additional comments?", "text")

q1.add_response(5)
q1.add_response(4)
q2.add_response("Great experience!")

print(survey.summary())
# {'title': 'Customer Satisfaction', 'num_questions': 2, 'total_responses': 3}

Composition vs Inheritance: Use composition when one object contains another (Survey has Questions). Use inheritance when one object is a specialized version of another (CSVExporter is a DataExporter). The course exam tests both patterns.

Inheritance: Base Classes and Subclasses

Inheritance creates an "is-a" relationship. A subclass inherits all methods and attributes from its parent and can override or extend them:

class BaseExporter:
    """Base class for all data exporters."""

    def __init__(self, data):
        self.data = data        # list of dicts
        self._exported = False

    def validate(self):
        """Check that data is non-empty."""
        if not self.data:
            raise ValueError("No data to export")
        return True

    def export(self):
        """Subclasses must override this method."""
        raise NotImplementedError("Subclasses must implement export()")

    def log_export(self, format_name):
        self._exported = True
        print(f"Exported {len(self.data)} records as {format_name}")


class CSVExporter(BaseExporter):
    """Exports data as CSV text."""

    def export(self):
        self.validate()
        headers = ",".join(self.data[0].keys())
        rows = [",".join(str(v) for v in row.values()) for row in self.data]
        result = headers + "\n" + "\n".join(rows)
        self.log_export("CSV")
        return result


class JSONExporter(BaseExporter):
    """Exports data as a JSON string."""

    def __init__(self, data, indent=2):
        super().__init__(data)  # call parent constructor
        self.indent = indent

    def export(self):
        import json
        self.validate()
        result = json.dumps(self.data, indent=self.indent)
        self.log_export("JSON")
        return result


# Usage
data = [{"name": "Alice", "score": 87}, {"name": "Bob", "score": 92}]

csv_out = CSVExporter(data)
print(csv_out.export())
# Exported 2 records as CSV
# name,score
# Alice,87
# Bob,92

json_out = JSONExporter(data, indent=4)
print(json_out.export())
# Exported 2 records as JSON
# [
#     {"name": "Alice", "score": 87},
#     ...
# ]

Method Overriding

When a subclass defines a method with the same name as one in the parent class, the subclass version overrides the parent. Use super() when you still need to call the parent's version:

class MarkdownExporter(BaseExporter):
    def validate(self):
        # Override: add extra validation, then call parent
        super().validate()
        for row in self.data:
            if not isinstance(row, dict):
                raise TypeError("Each row must be a dict")
        return True

    def export(self):
        self.validate()
        headers = list(self.data[0].keys())
        lines = ["| " + " | ".join(headers) + " |"]
        lines.append("| " + " | ".join(["---"] * len(headers)) + " |")
        for row in self.data:
            lines.append("| " + " | ".join(str(v) for v in row.values()) + " |")
        self.log_export("Markdown")
        return "\n".join(lines)

Polymorphism: Same Interface, Different Behavior

Polymorphism means calling the same method name on different objects and getting behavior specific to each object's class. This is powerful in data pipelines where you process items uniformly without knowing their exact type:

def run_export_pipeline(exporters):
    """Process any list of exporters - polymorphism in action."""
    results = {}
    for exporter in exporters:
        # Each exporter has .export() but implements it differently
        class_name = type(exporter).__name__
        results[class_name] = exporter.export()
    return results


data = [{"name": "Alice", "score": 87}, {"name": "Bob", "score": 92}]

exporters = [
    CSVExporter(data),
    JSONExporter(data),
    MarkdownExporter(data),
]

# Polymorphic call: same .export() method, different output formats
all_results = run_export_pipeline(exporters)
for fmt, output in all_results.items():
    print(f"\n--- {fmt} ---")
    print(output)

Polymorphism in Practice: In the example above, run_export_pipeline does not need to check the type of each exporter. It simply calls .export(), and each subclass provides its own implementation. This is the core idea behind polymorphism and a common exam topic.

Real-World Data Workflow Example

Combining composition, inheritance, and polymorphism in a single data pipeline:

class DataProcessor:
    """Base processor - defines the pipeline interface."""

    def process(self, value):
        raise NotImplementedError


class Trimmer(DataProcessor):
    def process(self, value):
        return value.strip() if isinstance(value, str) else value


class UpperCaser(DataProcessor):
    def process(self, value):
        return value.upper() if isinstance(value, str) else value


class Pipeline:
    """Composition: a Pipeline HAS processors."""

    def __init__(self, processors):
        self.processors = processors  # list of DataProcessor objects

    def run(self, value):
        for proc in self.processors:
            value = proc.process(value)  # polymorphic call
        return value


pipe = Pipeline([Trimmer(), UpperCaser()])
print(pipe.run("  hello world  "))  # HELLO WORLD

2.3.3 Object Identity and Comparisons

Understanding the difference between object identity and equality is crucial for avoiding subtle bugs and is a frequent exam topic.

Reference Variables: Shared vs Independent Objects

When you assign an object to a new variable, you create a new reference to the same object in memory, not a copy:

class Bucket:
    def __init__(self, items):
        self.items = items

# Both variables point to the SAME object
a = Bucket([1, 2, 3])
b = a                      # b is an alias for a

b.items.append(4)
print(a.items)             # [1, 2, 3, 4]  - a is also affected!
print(a is b)             # True  - same object in memory

# To make an independent copy:
import copy
c = copy.copy(a)           # shallow copy
c.items.append(5)
print(a.items)             # [1, 2, 3, 4, 5]  - still shared list! (shallow)

d = copy.deepcopy(a)       # deep copy - fully independent
d.items.append(6)
print(a.items)             # [1, 2, 3, 4, 5] - a is NOT affected
print(d.items)             # [1, 2, 3, 4, 5, 6]

Aliasing Pitfall: When objects contain mutable attributes (lists, dicts), assigning one variable to another creates a shared reference. Modifying through one reference affects all aliases. This is one of the most common sources of bugs in data pipelines.

Mutation of Lists Inside Objects (Aliasing)

A particularly tricky case occurs when a mutable list is passed into an object's constructor:

shared_list = [1, 2, 3]

class Container:
    def __init__(self, data):
        self.data = data  # stores a REFERENCE, not a copy

c1 = Container(shared_list)
c2 = Container(shared_list)

c1.data.append(99)

print(c2.data)         # [1, 2, 3, 99]  - c2 is also affected!
print(shared_list)     # [1, 2, 3, 99]  - original list too!

# Fix: copy the list in the constructor
class SafeContainer:
    def __init__(self, data):
        self.data = list(data)  # make a copy!

s1 = SafeContainer([10, 20])
s2 = SafeContainer([10, 20])
s1.data.append(30)
print(s2.data)         # [10, 20]  - unaffected

Comparing with `==` vs `is`

Python provides two distinct comparison operators:

Operator	Checks	Question It Answers
`==`	Equality (value)	"Do these have the same content?"
`is`	Identity (memory address)	"Are these the exact same object?"

a = [1, 2, 3]
b = [1, 2, 3]
c = a

print(a == b)    # True  - same values
print(a is b)    # False - different objects in memory
print(a is c)    # True  - same object (c is an alias for a)

# Special case: None should always be checked with 'is'
x = None
print(x is None)    # True  (preferred)
print(x == None)   # True  (works but not Pythonic)

# Small integers are cached by Python (implementation detail):
a = 256
b = 256
print(a is b)    # True  (cached)

a = 257
b = 257
print(a is b)    # May be False (not guaranteed to be cached)

Exam Rule of Thumb: Use is only for None checks (if x is None) and when you specifically need identity. Use == for all value comparisons. Never rely on is for integer or string comparison in production code.

Custom Equality with `eq()`

By default, == on custom objects checks identity (same as is). You can override this by implementing the __eq__ dunder method:

class DataPoint:
    def __init__(self, x, y, label=""):
        self.x = x
        self.y = y
        self.label = label

    def __eq__(self, other):
        """Two DataPoints are equal if they have the same coordinates."""
        if not isinstance(other, DataPoint):
            return NotImplemented
        return self.x == other.x and self.y == other.y

    def __repr__(self):
        return f"DataPoint(x={self.x}, y={self.y}, label='{self.label}')"

    def __str__(self):
        return f"({self.x}, {self.y})"


p1 = DataPoint(3, 5, "origin")
p2 = DataPoint(3, 5, "copy")
p3 = DataPoint(1, 2)

# __eq__ compares x and y only (not label)
print(p1 == p2)    # True   (same coordinates)
print(p1 == p3)    # False  (different coordinates)
print(p1 is p2)    # False  (different objects in memory)

# __repr__ vs __str__
print(repr(p1))    # DataPoint(x=3, y=5, label='origin')
print(str(p1))     # (3, 5)
print(p1)          # (3, 5)  - print() calls __str__

# __repr__ is used in lists and debugging
print([p1, p3])    # [DataPoint(x=3, y=5, label='origin'), DataPoint(x=1, y=2, label='')]

`repr` and `str` Methods

Method	Purpose	Called By
`__repr__`	Unambiguous, developer-facing representation	`repr()`, interactive shell, lists
`__str__`	Readable, user-facing representation	`str()`, `print()`

__repr__ vs __str__: If only one is defined, implement __repr__. Python falls back to __repr__ when __str__ is not available, but not the other way around. The __repr__ output should ideally be valid Python that could recreate the object.

A complete example putting identity and comparison together:

class SurveyResponse:
    def __init__(self, respondent_id, answers):
        self.respondent_id = respondent_id
        self.answers = dict(answers)  # defensive copy

    def __eq__(self, other):
        if not isinstance(other, SurveyResponse):
            return NotImplemented
        return self.respondent_id == other.respondent_id

    def __repr__(self):
        return f"SurveyResponse({self.respondent_id!r}, {self.answers!r})"

    def __str__(self):
        return f"Response from {self.respondent_id} ({len(self.answers)} answers)"


r1 = SurveyResponse("U100", {"q1": 5, "q2": 3})
r2 = SurveyResponse("U100", {"q1": 4, "q2": 2})

print(r1 == r2)    # True  (same respondent_id)
print(r1 is r2)    # False (different objects)
print(r1)          # Response from U100 (2 answers)
print(repr(r1))    # SurveyResponse('U100', {'q1': 5, 'q2': 3})

Practice Quiz: OOP for Data Modeling

Q1. What is the purpose of the `init` method in a Python class?

A) It deletes the object from memory when it is no longer needed

B) It is the constructor that initializes instance attributes when an object is created

C) It is a static method that creates the class itself

D) It defines class-level variables shared by all instances

Explanation: __init__ is the constructor method. It is automatically called when a new instance is created and is used to set up the object's initial state through self.attribute = value assignments.

Q2. What happens when you access `self.__secret` from outside the class `MyClass`?

A) It returns None

B) It raises an AttributeError because of name mangling

C) It works perfectly since Python has no private attributes

D) It raises a PermissionError

Explanation: Double-underscore attributes trigger name mangling. Python renames __secret to _MyClass__secret internally, so accessing obj.__secret directly raises AttributeError. The attribute still exists as obj._MyClass__secret.

Q3. Which decorator turns a method into a property that can be accessed like an attribute?

A) @staticmethod

B) @classmethod

C) @property

D) @attribute

Explanation: The @property decorator defines a getter that allows a method to be accessed using attribute syntax (e.g., obj.name instead of obj.name()). You can also define a setter with @name.setter.

Q4. Consider the code below. What does it print?

class Box:
    def __init__(self, items):
        self.items = items

a = Box([1, 2])
b = a
b.items.append(3)
print(a.items)

A) [1, 2]

B) [1, 2, 3]

C) [3]

D) An error is raised

Explanation: b = a does not create a copy; it creates an alias. Both a and b reference the same object in memory. When b.items.append(3) modifies the list, the change is visible through a as well.

Q5. Which OOP concept is demonstrated when a `Survey` class contains a list of `Question` objects?

A) Inheritance

B) Composition

C) Polymorphism

D) Encapsulation

Explanation: Composition is a "has-a" relationship where one class contains instances of another class as attributes. A Survey has Question objects. This is different from inheritance, which is an "is-a" relationship.

Q6. What is the output of the following code?

class Animal:
    def speak(self):
        return "..."

class Dog(Animal):
    def speak(self):
        return "Woof"

class Cat(Animal):
    def speak(self):
        return "Meow"

animals = [Dog(), Cat(), Dog()]
print([a.speak() for a in animals])

A) ["...", "...", "..."]

B) ["Woof", "Meow", "Woof"]

C) An error because Animal.speak() is overridden

D) ["Woof", "Woof", "Meow"]

Explanation: This is polymorphism. Each object in the list calls its own class's version of speak(). The Dog objects return "Woof" and the Cat object returns "Meow". The base class method is overridden by each subclass.

Q7. What is the difference between `==` and `is` in Python?

A) They are identical and can always be used interchangeably

B) == checks identity while is checks equality

C) == checks value equality while is checks if two references point to the same object

D) is only works with numbers and strings

Explanation: The == operator compares values (calling __eq__). The is operator compares identity — whether two variables reference the exact same object in memory. Use is primarily for None checks.

Q8. What does `super().init(data)` do inside a subclass constructor?

A) It creates a new parent class object

B) It calls the parent class's __init__ method to initialize inherited attributes

C) It copies all methods from the parent class

D) It makes the subclass a static class

Explanation: super() returns a proxy object for the parent class. Calling super().__init__(data) invokes the parent's constructor so that any attributes or setup logic defined in the parent class are properly initialized in the subclass instance.

Q9. If a class defines `repr` but NOT `str`, what does `print(obj)` use?

A) It raises an AttributeError

B) It prints the memory address only

C) It falls back to __repr__

D) It prints an empty string

Explanation: When __str__ is not defined, Python falls back to __repr__ for print() and str() calls. The reverse is not true — repr() will never fall back to __str__. This is why __repr__ is the more important method to implement.

Q10. Consider the following code. What does `p1 == p2` evaluate to?

class Point:
    def __init__(self, x, y):
        self.x = x
        self.y = y

p1 = Point(3, 4)
p2 = Point(3, 4)
print(p1 == p2)

A) True — because they have the same x and y values

B) False — because __eq__ is not defined, so it defaults to identity comparison

C) An error is raised because __eq__ is not defined

D) True — because Python compares all attributes automatically

Explanation: Without a custom __eq__ method, Python's default == falls back to identity comparison (same as is). Since p1 and p2 are two different objects in memory, p1 == p2 returns False. You must implement __eq__ for value-based equality.

Previous: 2.2 Module Management & Exception Handling Next: 2.4 SQL for Data Analysts

Object-Oriented Programming for Data Modeling

2.3.1 Basic OOP for Data Modeling

Defining Classes with the class Keyword

Instance Methods

Encapsulation: _protected and __private

Getters and Setters with @property