Chapter 6 Object-Oriented Programming for Data Scientists

6.1 Introduction to OOP

Object-Oriented Programming (OOP) is a programming paradigm centered around the concept of objects, which are instances of classes. OOP is particularly beneficial in data science because it allows for modular, reusable, and maintainable code. By encapsulating data and functions that operate on the data into objects, OOP helps in organizing complex data workflows and creating scalable data science solutions.

Key Concepts of OOP:

  1. Class: A blueprint for creating objects. A class defines a set of attributes and methods that the created objects will have.

  2. Object: An instance of a class. Each object can have unique attribute values, even if they share the same class.

  3. Encapsulation: Bundling of data (attributes) and methods (functions) that operate on the data into a single unit (class).

  4. Inheritance: A mechanism where one class can inherit attributes and methods from another class.

  5. Polymorphism: The ability to use a single interface to represent different underlying data types.

  6. Abstraction: Hiding the complex implementation details and showing only the necessary features of an object.

6.1.1 What is Object-Oriented Programming?

Object-Oriented Programming is a way of designing software by defining data structures as objects that can contain both data and functions. These objects interact with one another to perform tasks and solve problems. OOP concepts are particularly useful in data science for creating custom data structures, implementing machine learning models, and managing data pipelines.

6.1.2 Advantages of OOP

  • Modularity: The source code for an object can be written and maintained independently of the source code for other objects.

  • Reusability: Once an object is created, it can be reused in different programs.

  • Scalability: OOP makes it easier to manage and scale large codebases.

  • Maintainability: Code is easier to maintain and modify over time.

6.1.3 OOP vs. Procedural Programming

Procedural programming focuses on functions, or procedures, that perform operations on data. In contrast, OOP focuses on objects that encapsulate data and the functions that operate on the data.

Procedural Programming Example:

# Procedural approach to calculating area of a rectangle
def calculate_area(length, width):
    return length * width

length = 5
width = 3
area = calculate_area(length, width)
print(f"The area of the rectangle is {area}")

Object-Oriented Programming Example:

# OOP approach to calculating area of a rectangle
class Rectangle:
    def __init__(self, length, width):
        self.length = length
        self.width = width
    
    def calculate_area(self):
        return self.length * self.width

rect = Rectangle(5, 3)
area = rect.calculate_area()
print(f"The area of the rectangle is {area}")

In the OOP example, the Rectangle class encapsulates both the data (length and width) and the behavior (calculate_area) in one place. This encapsulation makes the code more modular and easier to manage.

6.1.4 Practical Example in Data Science

Let’s consider a practical example in data science where we create a class to represent a dataset and perform basic operations on it.

import pandas as pd

class DataSet:
    def __init__(self, data):
        self.data = pd.DataFrame(data)
    
    def get_summary(self):
        return self.data.describe()
    
    def add_column(self, column_name, data):
        self.data[column_name] = data
    
    def get_column(self, column_name):
        return self.data[column_name]

# Creating a dataset instance
data = {
    'A': [1, 2, 3, 4, 5],
    'B': [5, 4, 3, 2, 1]
}

dataset = DataSet(data)
print(dataset.get_summary())

# Adding a new column
dataset.add_column('C', [10, 20, 30, 40, 50])
print(dataset.get_column('C'))

In this example, the DataSet class encapsulates the data in a Pandas DataFrame and provides methods to get a summary of the data, add a new column, and retrieve a specific column. This approach makes the code more organized and reusable, highlighting the advantages of OOP in data science.

6.2 Classes and Objects

In Object-Oriented Programming (OOP), a class is a blueprint for creating objects (instances of the class). A class defines a set of attributes and methods that the created objects will have. Understanding classes and objects is fundamental to leveraging OOP in your data science projects.

6.2.1 Understanding Classes

A class is a template for creating objects. It defines a set of attributes that will characterize any object created from the class and the methods that can be performed on these objects.

Basic Structure of a Class:

class MyClass:
    # Class attribute
    class_variable = "I am a class variable"

    # Constructor
    def __init__(self, instance_variable):
        self.instance_variable = instance_variable

    # Instance method
    def display(self):
        print(f"Class Variable: {self.class_variable}")
        print(f"Instance Variable: {self.instance_variable}")

# Creating an instance of the class
obj = MyClass("I am an instance variable")
obj.display()

In this example, MyClass has a class attribute class_variable, an instance attribute instance_variable, and an instance method display.

6.2.2 Creating Objects

An object is an instance of a class. When a class is defined, no memory is allocated until an object of that class is created. Each object can have different attribute values, even if they share the same class.

Creating and Using Objects:

class DataPoint:
    def __init__(self, x, y):
        self.x = x
        self.y = y

    def display_point(self):
        print(f"Point({self.x}, {self.y})")

# Creating objects
point1 = DataPoint(1, 2)
point2 = DataPoint(3, 4)

point1.display_point()  # Output: Point(1, 2)
point2.display_point()  # Output: Point(3, 4)

In this example, DataPoint is a class with attributes x and y and a method display_point. We create two instances of DataPoint, each with different values for x and y.

6.2.3 The self Parameter

In Python, the self parameter is a reference to the current instance of the class. It is used to access variables that belong to the class. It must be the first parameter of any function in the class.

Using self:

class Employee:
    def __init__(self, name, salary):
        self.name = name
        self.salary = salary

    def display_employee(self):
        print(f"Name: {self.name}, Salary: {self.salary}")

# Creating an instance
emp1 = Employee("John", 50000)
emp1.display_employee()  # Output: Name: John, Salary: 50000

Here, self.name and self.salary refer to the name and salary attributes of the instance emp1.

6.2.4 Real-world Examples in Data Science

Let’s create a class to represent a simple linear regression model.

Simple Linear Regression Class:

import numpy as np

class SimpleLinearRegression:
    def __init__(self):
        self.coefficient = None
        self.intercept = None

    def fit(self, X, y):
        X_mean = np.mean(X)
        y_mean = np.mean(y)
        self.coefficient = np.sum((X - X_mean) * (y - y_mean)) / np.sum((X - X_mean) ** 2)
        self.intercept = y_mean - self.coefficient * X_mean

    def predict(self, X):
        return self.coefficient * X + self.intercept

# Sample data
X = np.array([1, 2, 3, 4, 5])
y = np.array([2, 3, 5, 6, 5])

# Creating an instance and fitting the model
model = SimpleLinearRegression()
model.fit(X, y)

# Making predictions
predictions = model.predict(X)
print("Predictions:", predictions)

In this example:

  • The SimpleLinearRegression class encapsulates the linear regression logic.

  • The fit method calculates the coefficient and intercept based on the input data.

  • The predict method uses the fitted model to make predictions on new data.

This approach makes the linear regression model reusable and easy to integrate into larger data science workflows.

6.3 Attributes and Methods

In Object-Oriented Programming (OOP), attributes and methods are key components of classes. Attributes are variables that hold data, while methods are functions that operate on this data. Understanding how to define and use attributes and methods is crucial for building effective data science applications using OOP.

6.3.1 Instance Attributes

Instance attributes are variables that hold data specific to an instance of a class. They are defined within the __init__ method and are prefixed with self to indicate that they belong to the instance.

Defining and Using Instance Attributes:

class Student:
    def __init__(self, name, grade):
        self.name = name
        self.grade = grade

    def display_student(self):
        print(f"Name: {self.name}, Grade: {self.grade}")

# Creating instances
student1 = Student("Alice", "A")
student2 = Student("Bob", "B")

student1.display_student()  # Output: Name: Alice, Grade: A
student2.display_student()  # Output: Name: Bob, Grade: B

In this example, name and grade are instance attributes that store data specific to each Student instance.

6.3.1.1 Class Attributes

Class attributes are variables that are shared across all instances of a class. They are defined directly within the class body, outside any methods.

Defining and Using Class Attributes:

class School:
    school_name = "Greenwood High"  # Class attribute

    def __init__(self, student_name):
        self.student_name = student_name  # Instance attribute

    def display_student(self):
        print(f"Student: {self.student_name}, School: {School.school_name}")

# Creating instances
student1 = School("Alice")
student2 = School("Bob")

student1.display_student()  # Output: Student: Alice, School: Greenwood High
student2.display_student()  # Output: Student: Bob, School: Greenwood High

In this example, school_name is a class attribute shared by all instances of the School class, while student_name is an instance attribute unique to each instance.

6.3.1.2 Instance Methods

Instance methods are functions defined within a class that operate on instance attributes. They must include self as their first parameter to access instance attributes and other methods.

Defining and Using Instance Methods:

class Rectangle:
    def __init__(self, length, width):
        self.length = length
        self.width = width

    def calculate_area(self):
        return self.length * self.width

    def display_area(self):
        print(f"Area: {self.calculate_area()}")

# Creating an instance
rect = Rectangle(5, 3)
rect.display_area()  # Output: Area: 15

In this example, calculate_area and display_area are instance methods that operate on the length and width attributes of the Rectangle instance.

6.3.1.3 Class Methods and Static Methods

Class methods and static methods are two special types of methods in Python classes. Class methods operate on class attributes, while static methods are utility methods that do not operate on instance or class attributes.

Class Methods: Class methods are defined using the @classmethod decorator and take cls (the class itself) as the first parameter.

class Circle:
    pi = 3.14159  # Class attribute

    def __init__(self, radius):
        self.radius = radius

    @classmethod
    def from_diameter(cls, diameter):
        radius = diameter / 2
        return cls(radius)

    def calculate_area(self):
        return Circle.pi * (self.radius ** 2)

# Creating an instance using the class method
circle = Circle.from_diameter(10)
print(f"Radius: {circle.radius}, Area: {circle.calculate_area()}")  # Output: Radius: 5.0, Area: 78.53975

Static Methods: Static methods are defined using the @staticmethod decorator and do not take self or cls as a parameter.

class MathUtils:
    @staticmethod
    def add(a, b):
        return a + b

# Using the static method
result = MathUtils.add(5, 3)
print(f"Result: {result}")  # Output: Result: 8

In this example, add is a static method that performs addition without accessing any class or instance attributes.

6.3.2 Practical Examples with Data Science Models

Let’s create a class to represent a simple linear regression model with attributes and methods tailored to data science.

Linear Regression Class with Attributes and Methods:

import numpy as np

class SimpleLinearRegression:
    def __init__(self):
        self.coefficient = None
        self.intercept = None

    def fit(self, X, y):
        X_mean = np.mean(X)
        y_mean = np.mean(y)
        self.coefficient = np.sum((X - X_mean) * (y - y_mean)) / np.sum((X - X_mean) ** 2)
        self.intercept = y_mean - self.coefficient * X_mean

    def predict(self, X):
        return self.coefficient * X + self.intercept

    def score(self, X, y):
        y_pred = self.predict(X)
        ss_total = np.sum((y - np.mean(y)) ** 2)
        ss_residual = np.sum((y - y_pred) ** 2)
        r2_score = 1 - (ss_residual / ss_total)
        return r2_score

# Sample data
X = np.array([1, 2, 3, 4, 5])
y = np.array([2, 3, 5, 6, 5])

# Creating an instance and fitting the model
model = SimpleLinearRegression()
model.fit(X, y)

# Making predictions
predictions = model.predict(X)
print("Predictions:", predictions)

# Calculating R-squared score
r2 = model.score(X, y)
print("R-squared score:", r2)

In this example: - The SimpleLinearRegression class encapsulates the logic for fitting a linear regression model, making predictions, and calculating the R-squared score. - The fit method calculates the coefficient and intercept. - The predict method uses the fitted model to make predictions on new data. - The score method calculates the R-squared score to evaluate the model’s performance.

This approach demonstrates the power of using attributes and methods in a class to organize and encapsulate the functionality of a data science model.

  1. Encapsulation
    • Public vs. Private Attributes
    • Getter and Setter Methods
    • Property Decorators
    • Data Encapsulation in Data Science Projects
  2. Inheritance
    • What is Inheritance?
    • Single Inheritance
    • Multiple Inheritance
    • Overriding Methods
    • The super() Function
    • Reusing Code in Data Science Workflows
  3. Polymorphism
    • Method Overloading
    • Method Overriding
    • Duck Typing
    • Polymorphism in Data Processing Pipelines
  4. Abstraction
    • Abstract Classes and Methods
    • The abc Module
    • Abstracting Common Data Science Tasks
  5. Special Methods
    • The __init__ Method
    • Other Magic Methods (__str__, __repr__, __len__, etc.)
    • Operator Overloading
    • Enhancing Data Science Classes with Special Methods
  6. Design Patterns in OOP
    • Introduction to Design Patterns
    • Common Design Patterns (Singleton, Factory, Observer, etc.)
    • Design Patterns for Data Science Projects
  7. OOP with Data Science Libraries
    • Integrating OOP with Libraries like Pandas, NumPy, and Scikit-learn
    • Building Custom Transformers and Pipelines
    • Extending Existing Classes
  8. Best Practices in OOP
    • Writing Readable and Maintainable Code
    • Principles of OOP (SOLID)
    • Avoiding Common Pitfalls
    • Best Practices in Data Science Context
  9. Case Study: Building a Data Science Application
    • Problem Statement
    • Designing the Class Structure
    • Implementing the Solution
    • Testing and Debugging
    • Deploying Data Science Models Using OOP
  10. Advanced Topics
    • Metaclasses
    • Decorators and Descriptors
    • MRO (Method Resolution Order)
    • Advanced OOP Techniques in Data Science
  11. Summary and Key Takeaways
    • Recap of Key Concepts
    • Tips for Mastering OOP in Data Science
  12. Exercises and Projects
    • Practice Problems
    • Mini Projects
    • Solutions

6.3.3 Dynamic Variables

Dynamic variables are typically instance variables in the context of classes. They are created and managed at runtime, usually within the methods of a class. Each instance (or object) of a class can have different values for these variables.

6.3.3.1 Characteristics of Dynamic Variables:

  • Instance-specific: Each instance of a class can have its own unique values for these variables.
  • Defined within methods: Typically created and accessed using the self keyword within instance methods.
  • Dynamic in nature: Can be added, modified, or deleted at runtime.

Example:

class DynamicExample:
    def __init__(self, value):
        self.dynamic_variable = value

    def update_value(self, new_value):
        self.dynamic_variable = new_value

# Creating instances
obj1 = DynamicExample(10)
obj2 = DynamicExample(20)

print(obj1.dynamic_variable)  # Output: 10
print(obj2.dynamic_variable)  # Output: 20

# Updating dynamic variable
obj1.update_value(30)
print(obj1.dynamic_variable)  # Output: 30

6.3.4 Static Variables

Static variables, also known as class variables, are shared across all instances of a class. They are defined at the class level and are not tied to any specific instance. These variables are accessed using the class name or any instance.

6.3.4.1 Characteristics of Static Variables:

  • Class-wide: Shared among all instances of the class.
  • Defined within the class but outside any methods: Typically declared directly within the class body.
  • Consistent across instances: Changing the value affects all instances.

Example:

class StaticExample:
    static_variable = 42  # This is a static variable

    def __init__(self, value):
        self.instance_variable = value  # This is a dynamic variable

# Creating instances
obj1 = StaticExample(10)
obj2 = StaticExample(20)

print(obj1.static_variable)  # Output: 42
print(obj2.static_variable)  # Output: 42

# Updating static variable through class
StaticExample.static_variable = 100

print(obj1.static_variable)  # Output: 100
print(obj2.static_variable)  # Output: 100

# Accessing static variable through an instance
obj1.static_variable = 200
print(obj1.static_variable)  # Output: 200
print(obj2.static_variable)  # Output: 100

# The class variable itself remains unchanged unless accessed through the class
print(StaticExample.static_variable)  # Output: 100

In this example: - static_variable is a class variable (static), shared across all instances. - instance_variable is an instance variable (dynamic), unique to each instance.

6.3.5 Summary

  • Dynamic Variables: Instance-specific, created within methods, and can vary between instances.
  • Static Variables: Shared across all instances, defined at the class level, and maintain consistent values unless explicitly changed.

6.4 Pandas Exercises

import pandas as pd
import numpy as np

6.4.1 Question

Write a Pandas program to create and display a one-dimensional array-like object containing an array of data using Pandas module.

import pandas as pd

# Create a Pandas Series
data = [10, 20, 30, 40, 50]
series = pd.Series(data)

# Display the Series
print(series)
## 0    10
## 1    20
## 2    30
## 3    40
## 4    50
## dtype: int64

6.4.2 Question

Write a Pandas program to convert a Pandas module Series to Python list and it’s type.

import pandas as pd

data = pd.Series([1, 2, 3, 4, 5])

# Convert series to python list

data_list = list(data)    ### general solution

data_list = data.to_list() ### pandas optimized 


print(type(data_list))
## <class 'list'>

6.4.3 Question

Write a Pandas program to add, subtract, multiple and divide two Pandas Series.

serias_a = pd.Series([2, 4, 6, 8, 10])
serias_b = pd.Series([1, 3, 5, 7, 9])

# addition
print('addition\n', serias_a + serias_b)
## addition
##  0     3
## 1     7
## 2    11
## 3    15
## 4    19
## dtype: int64
# subtraction
print('\nsubtraction\n', serias_a - serias_b)
## 
## subtraction
##  0    1
## 1    1
## 2    1
## 3    1
## 4    1
## dtype: int64
# multiplication
print('\nmultiplication\n', serias_a * serias_b)
## 
## multiplication
##  0     2
## 1    12
## 2    30
## 3    56
## 4    90
## dtype: int64
# division
print('\ndivision\n', serias_a / serias_b)
## 
## division
##  0    2.000000
## 1    1.333333
## 2    1.200000
## 3    1.142857
## 4    1.111111
## dtype: float64

6.4.4 Question

Write a Pandas program to compare the elements of the two Pandas Series.

serias_a = pd.Series([2, 4, 6, 8, 10])
serias_b = pd.Series([1, 3, 5, 7, 10])

serias_a == serias_b
## 0    False
## 1    False
## 2    False
## 3    False
## 4     True
## dtype: bool
serias_a > serias_b
## 0     True
## 1     True
## 2     True
## 3     True
## 4    False
## dtype: bool
serias_a < serias_b
## 0    False
## 1    False
## 2    False
## 3    False
## 4    False
## dtype: bool
serias_a != serias_b
## 0     True
## 1     True
## 2     True
## 3     True
## 4    False
## dtype: bool

6.4.5 Question

Write a Pandas program to convert a dictionary to a Pandas series.

dict_a = {"cola": [1, 2, 3]}

pd.Series(dict_a['cola'])
## 0    1
## 1    2
## 2    3
## dtype: int64
dict_a = {"a": 1, "b": 2, "c":3}

pd.Series(dict_a)
## a    1
## b    2
## c    3
## dtype: int64

6.4.6 sorting

sort_values is a method in the pandas library used to sort the values in a DataFrame or Series.

  • Purpose: sort_values is used to sort a DataFrame or Series by one or more columns or by the values in the Series.

  • Syntax:

    DataFrame.sort_values(by, axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last', ignore_index=False, key=None)

6.4.6.1 Examples:

Sample dataframe

import pandas as pd

data = {
   'NAME': ['David', 'Alice', 'Charlie',  'Bob',],
   'AGE': [25, 30,  40, 35],
   'SALARY': [50000, 600000, 55000, 700000]
}

df = pd.DataFrame(data)
df
##       NAME  AGE  SALARY
## 0    David   25   50000
## 1    Alice   30  600000
## 2  Charlie   40   55000
## 3      Bob   35  700000
  1. Sort by a Single Column:
# Sort by 'age' in ascending order (DEFAULT)
sorted_df = df.sort_values(by='NAME')
print(sorted_df)
##       NAME  AGE  SALARY
## 1    Alice   30  600000
## 3      Bob   35  700000
## 2  Charlie   40   55000
## 0    David   25   50000
  1. Sort by Multiple Columns:
 # Sort by 'age' in ascending order, then by 'salary' in descending order
 
 sorted_df = df.sort_values(by=['AGE', 'SALARY'], ascending=[True, False])
 
 print(sorted_df)
## unexpected indent (<string>, line 3)
  1. Handling NaN Values:
data = {
   'NAME': ['Alice', 'Bob', 'Charlie', 'David'],
   'AGE': [25, 30, None, 40]
}

df = pd.DataFrame(data)

# Sort by 'age' and place NaN values at the start

sorted_df = df.sort_values(by='AGE', na_position='first')
print(sorted_df)
##       NAME   AGE
## 2  Charlie   NaN
## 0    Alice  25.0
## 1      Bob  30.0
## 3    David  40.0

6.4.6.2 Key Points:

axis: Axis to be sorted along (0 for index, 1 for columns). Default is 0.

ignore_index: If True, the resulting index will be labeled 0, 1, 2, …, n - 1. Default is False.

inplace: If True, perform the operation in place and return None. Default is False.

6.4.7 groupby operation

The groupby operation in pandas is a powerful tool for aggregating and transforming data. It allows you to split your DataFrame into groups based on one or more columns, apply functions to each group, and then combine the results back into a DataFrame or Series.

import pandas as pd

data = {
    'department': ['Sales', 'Sales', 'HR', 'HR', 'IT', 'IT', 'Finance', 'Finance'],
    'employee': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank', 'Grace', 'Hannah'],
    'salary': [70000, 80000, 50000, 60000, 90000, 85000, 75000, 65000],
    'bonus': [5000, 7000, 4000, 6000, 9000, 8500, 7500, 6500]
}

df_company = pd.DataFrame(data)
df_company
##   department employee  salary  bonus
## 0      Sales    Alice   70000   5000
## 1      Sales      Bob   80000   7000
## 2         HR  Charlie   50000   4000
## 3         HR    David   60000   6000
## 4         IT      Eve   90000   9000
## 5         IT    Frank   85000   8500
## 6    Finance    Grace   75000   7500
## 7    Finance   Hannah   65000   6500

6.4.7.1 Group by a Single Column

Suppose you want to find the average salary by department.

Final object is a series with index is department.

# Group by 'department' and calculate the mean salary
grouped_df = df_company.groupby('department')['salary'].mean()

print(grouped_df)
## department
## Finance    70000.0
## HR         55000.0
## IT         87500.0
## Sales      75000.0
## Name: salary, dtype: float64
print(type(grouped_df))
## <class 'pandas.core.series.Series'>
# Convert the resulting Series into a DataFrame
grouped_df = grouped_df.reset_index()

# Rename the column for clarity
grouped_df.columns = ['department', 'average_salary']

grouped_df
##   department  average_salary
## 0    Finance         70000.0
## 1         HR         55000.0
## 2         IT         87500.0
## 3      Sales         75000.0

6.4.7.2 Group by Multiple Columns:

You can group by multiple columns.

For example, find the total compensation (salary + bonus) for each employee in each department.

# Group by 'department' and 'employee', and calculate total compensation

df_company['total_compensation'] = df_company['salary'] + df_company['bonus']

grouped_df = df_company.groupby(['department', 'employee'])['total_compensation'].sum()

print(grouped_df)
## department  employee
## Finance     Grace       82500
##             Hannah      71500
## HR          Charlie     54000
##             David       66000
## IT          Eve         99000
##             Frank       93500
## Sales       Alice       75000
##             Bob         87000
## Name: total_compensation, dtype: int64
print(type(grouped_df))
## <class 'pandas.core.series.Series'>

6.4.7.3 Aggregation Functions:

You can use multiple aggregation functions on the grouped data.

For example, find the sum and mean of the salary and bonus for each department.

# Group by 'department' and calculate sum and mean of 'salary' and 'bonus'

agg_df = df_company.groupby('department').agg({ 'salary': ['sum', 'mean'],
                                                'bonus': ['sum', 'mean']
                                              })

print(agg_df)
##             salary           bonus        
##                sum     mean    sum    mean
## department                                
## Finance     140000  70000.0  14000  7000.0
## HR          110000  55000.0  10000  5000.0
## IT          175000  87500.0  17500  8750.0
## Sales       150000  75000.0  12000  6000.0
print(type(agg_df))
## <class 'pandas.core.frame.DataFrame'>

6.4.7.4 groupby and unnested data

agg_df = df_company.groupby('department').agg(
              total_salary = pd.NamedAgg(column='salary', aggfunc='sum'),
              avg_salary = pd.NamedAgg(column='salary', aggfunc='mean'),
              total_bonus = pd.NamedAgg(column='bonus', aggfunc='sum'),
              avg_bonus = pd.NamedAgg(column='bonus', aggfunc='mean')
)

print(agg_df)
##             total_salary  avg_salary  total_bonus  avg_bonus
## department                                                  
## Finance           140000     70000.0        14000     7000.0
## HR                110000     55000.0        10000     5000.0
## IT                175000     87500.0        17500     8750.0
## Sales             150000     75000.0        12000     6000.0
agg_df = df_company.groupby('department').agg({
                                            'salary': 'sum', 'bonus': 'sum'
                                            }).rename(columns={
                                              'salary': 'total_salary', 
                                              'bonus': 'total_bonus'
                                              })

agg_df['avg_salary'] = df_company.groupby('department')['salary'].mean()
agg_df['avg_bonus'] = df_company.groupby('department')['bonus'].mean()

print(agg_df)
##             total_salary  total_bonus  avg_salary  avg_bonus
## department                                                  
## Finance           140000        14000     70000.0     7000.0
## HR                110000        10000     55000.0     5000.0
## IT                175000        17500     87500.0     8750.0
## Sales             150000        12000     75000.0     6000.0

6.4.7.5 Transformation

data = {
    'department': ['Sales', 'Sales', 'HR', 'HR', 'IT', 'IT', 'Finance', 'Finance'],
    'employee': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank', 'Grace', 'Hannah'],
    'salary': [70000, 80000, 50000, 60000, 90000, 85000, 95000, 65000],
}
df = pd.DataFrame(data)

# Group by 'department' and calculate sum of 'salary' with transform

df['total_salary'] = df.groupby('department')['salary'].transform('sum')

print(df)
##   department employee  salary  total_salary
## 0      Sales    Alice   70000        150000
## 1      Sales      Bob   80000        150000
## 2         HR  Charlie   50000        110000
## 3         HR    David   60000        110000
## 4         IT      Eve   90000        175000
## 5         IT    Frank   85000        175000
## 6    Finance    Grace   95000        160000
## 7    Finance   Hannah   65000        160000
df[df['salary']>80000]
##   department employee  salary  total_salary
## 4         IT      Eve   90000        175000
## 5         IT    Frank   85000        175000
## 6    Finance    Grace   95000        160000

6.4.7.6 Example

Lets perform a standardization operation on the salary column within each group defined by the department column

# Group by 'department' and calculate sum of 'salary' with transform

df2 = df.copy()

df2['demeaned_salary'] = df2.groupby('department')['salary'].transform(lambda x: (x - x.mean()))

print(df2)
##   department employee  salary  total_salary  demeaned_salary
## 0      Sales    Alice   70000        150000          -5000.0
## 1      Sales      Bob   80000        150000           5000.0
## 2         HR  Charlie   50000        110000          -5000.0
## 3         HR    David   60000        110000           5000.0
## 4         IT      Eve   90000        175000           2500.0
## 5         IT    Frank   85000        175000          -2500.0
## 6    Finance    Grace   95000        160000          15000.0
## 7    Finance   Hannah   65000        160000         -15000.0

6.4.7.7 Filtering Groups:

You can filter out groups that meet a specific condition.

For example, keep only departments with a total salary of more than $150,000.

# Group by 'department' and filter departments with total salary > 150,000

filtered_df = df.groupby('department').filter(lambda x: x['salary'].sum() > 150000)

print(filtered_df)
##   department employee  salary  total_salary
## 4         IT      Eve   90000        175000
## 5         IT    Frank   85000        175000
## 6    Finance    Grace   95000        160000
## 7    Finance   Hannah   65000        160000
df2['total_salary'] = df.groupby('department')['salary'].transform('sum')

df2 = df2[df2['total_salary'] > 150000]  

df2
##   department employee  salary  total_salary  demeaned_salary
## 4         IT      Eve   90000        175000           2500.0
## 5         IT    Frank   85000        175000          -2500.0
## 6    Finance    Grace   95000        160000          15000.0
## 7    Finance   Hannah   65000        160000         -15000.0

6.4.7.8 Examples

df2 = df.copy()
df2 = df2.groupby('department').agg({'salary': ['sum', 'mean', 'count', 'min', 'max']})

df2
##             salary                             
##                sum     mean count    min    max
## department                                     
## Finance     160000  80000.0     2  65000  95000
## HR          110000  55000.0     2  50000  60000
## IT          175000  87500.0     2  85000  90000
## Sales       150000  75000.0     2  70000  80000

`

6.4.8 Sample Solutions

these are from interviewquery

6.4.8.1 Question

import pandas as pd

name_list = ["Tim Voss", "Nicole Johnson", "Elsa Williams", "John James", "Catherine Jones"]
age_list = [19, 20, 21, 20, 23]
color_list = ["red", "yellow", "green", "blue", "green"]
grades = [91, 95, 82, 75, 93]


students = {"name" : name_list,
            "age" : age_list,
            "favorite_color" : color_list,
            "grade" : grades}

students_df = pd.DataFrame(students)

students_df
##               name  age favorite_color  grade
## 0         Tim Voss   19            red     91
## 1   Nicole Johnson   20         yellow     95
## 2    Elsa Williams   21          green     82
## 3       John James   20           blue     75
## 4  Catherine Jones   23          green     93

Write a function named grades_colors to select only the rows where the student’s favorite color is green or red and their grade is above 90.

def grades_colors(df):
  
    df = df[(df['favorite_color'].isin(['green', 'red'])) & (df['grade'] > 90)]
  
    return df

grades_colors(students_df)
##               name  age favorite_color  grade
## 0         Tim Voss   19            red     91
## 4  Catherine Jones   23          green     93

Alternative

students_df.query("favorite_color.isin(('green', 'red')) and grade > 90")
##               name  age favorite_color  grade
## 0         Tim Voss   19            red     91
## 4  Catherine Jones   23          green     93

Using query method

colors = ["green", "red"]

students_df.query("favorite_color in @colors").query("grade > 90")
##               name  age favorite_color  grade
## 0         Tim Voss   19            red     91
## 4  Catherine Jones   23          green     93

Using loc

students_df.loc[(students_df['favorite_color'].isin(['green', 'red'])) &      
                (students_df['grade'] > 90)
                ]
##               name  age favorite_color  grade
## 0         Tim Voss   19            red     91
## 4  Catherine Jones   23          green     93

Alternative

color = students_df['favorite_color'].isin(['green', 'red'])
grade = students_df['grade'] > 90

students_df.loc[(grade) & (color)]

6.4.8.2 Question

You are given a dataframe with a single column, ‘var’.

Calculated the t-value for the mean of ‘var’ against a null hypothesis that \(\mu = \mu_0\)

Note: You do not have to calculate the p-value of the test or run the test.

var_data = [2,3,4,5,6,7,8,8,10]

df = pd.DataFrame({"var": var_data})

mu_0 = 5

def t_score(mu_0, df):
  
    n = df['var'].count()
    
    sample_mean = df['var'].mean()
    
    sample_std = df['var'].std()
    
    t = (sample_mean - mu_0) / (sample_std / pow(n, 1/2))
    
    return t

t_score(mu_0, df)
## 1.018055620761245

6.4.9 Question

Given a dataframe with three columns: client_id, ranking, value

Write a function to fill the NaN values in the value column with the previous non-NaN value from the same client_id ranked in ascending order.

If there doesn’t exist a previous client_id then return the previous value.

client_id = [1001, 1001, 1001, 1002, 1002, 1002, 1003, 1003]
ranking = [1, 2, 3, 1, 2, 3, 1, 2]
value = [1000, pd.NA, 1200, 1500, 1250, pd.NA, 1100, pd.NA]

clients_df = pd.DataFrame({
                  'client_id': client_id,
                  'ranking': ranking,
                  'value': value
})


def previous_nan_values(clients_df):
  
  clients_df = clients_df.sort_values(by=['client_id', 'ranking'])
  
  clients_df['value2'] = clients_df.groupby('client_id')['value'].ffill()
  
  return clients_df
  
previous_nan_values(clients_df)
##    client_id  ranking value value2
## 0       1001        1  1000   1000
## 1       1001        2  <NA>   1000
## 2       1001        3  1200   1200
## 3       1002        1  1500   1500
## 4       1002        2  1250   1250
## 5       1002        3  <NA>   1250
## 6       1003        1  1100   1100
## 7       1003        2  <NA>   1100

6.4.10 Data Filtering and Selection:

sample data

data = {
    'department': ['Sales', 'Sales', 'HR', 'HR', 'IT', 'IT', 'Finance', 'Finance'],
    'employee': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank', 'Grace', 'Hannah'],
    'salary': [70000, 80000, 50000, 60000, 90000, 85000, 95000, 65000],
}
df = pd.DataFrame(data)
df
##   department employee  salary
## 0      Sales    Alice   70000
## 1      Sales      Bob   80000
## 2         HR  Charlie   50000
## 3         HR    David   60000
## 4         IT      Eve   90000
## 5         IT    Frank   85000
## 6    Finance    Grace   95000
## 7    Finance   Hannah   65000

6.4.10.1 filtering rows

df[df['employee'] == 'Eve']
##   department employee  salary
## 4         IT      Eve   90000
df[df['department'].isin(['HR', 'IT'])]
##   department employee  salary
## 2         HR  Charlie   50000
## 3         HR    David   60000
## 4         IT      Eve   90000
## 5         IT    Frank   85000
df[df['salary'] > 80000]
##   department employee  salary
## 4         IT      Eve   90000
## 5         IT    Frank   85000
## 6    Finance    Grace   95000
df[(df['salary'] > 70000) & (df['salary'] < 90000)]
##   department employee  salary
## 1      Sales      Bob   80000
## 5         IT    Frank   85000
df.groupby('department').filter(lambda x: x['salary'].sum() > 150000)
##   department employee  salary
## 4         IT      Eve   90000
## 5         IT    Frank   85000
## 6    Finance    Grace   95000
## 7    Finance   Hannah   65000
df.loc[0:3, :]
##   department employee  salary
## 0      Sales    Alice   70000
## 1      Sales      Bob   80000
## 2         HR  Charlie   50000
## 3         HR    David   60000
df.iloc[0:3, :]
##   department employee  salary
## 0      Sales    Alice   70000
## 1      Sales      Bob   80000
## 2         HR  Charlie   50000

6.4.10.2 filtering columns

cols = ['employee', 'salary']

df[cols]
##   employee  salary
## 0    Alice   70000
## 1      Bob   80000
## 2  Charlie   50000
## 3    David   60000
## 4      Eve   90000
## 5    Frank   85000
## 6    Grace   95000
## 7   Hannah   65000
df.loc[0:3, ['employee', 'salary']]
##   employee  salary
## 0    Alice   70000
## 1      Bob   80000
## 2  Charlie   50000
## 3    David   60000
df.iloc[0:3, :]
##   department employee  salary
## 0      Sales    Alice   70000
## 1      Sales      Bob   80000
## 2         HR  Charlie   50000

6.4.11 Aggregation and Grouping:

df.groupby('department')['salary'].mean()
## department
## Finance    80000.0
## HR         55000.0
## IT         87500.0
## Sales      75000.0
## Name: salary, dtype: float64

6.4.12 Joining and Merging Data:

merged_df = pd.merge(df1, df2, on='key')

6.4.13 recoding

Sample Data

data = {
    'employee': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank', 'Grace', 'Hannah'],
    'salary': [70000, 80000, 50000, 60000, 90000, 85000, 75000, 65000]
}

df = pd.DataFrame(data)

df.head()
##   employee  salary
## 0    Alice   70000
## 1      Bob   80000
## 2  Charlie   50000
## 3    David   60000
## 4      Eve   90000

6.4.13.1 np.where() method

# this condition creates a series of boolean
condition = df['salary'] > 85000

df['label'] = np.where(condition, 'high', 'low')

df.head()
##   employee  salary label
## 0    Alice   70000   low
## 1      Bob   80000   low
## 2  Charlie   50000   low
## 3    David   60000   low
## 4      Eve   90000  high
condition = df['employee'] == 'David'

df['employee'] = np.where(condition, 'Davut', df['employee'])

df.head()
##   employee  salary label
## 0    Alice   70000   low
## 1      Bob   80000   low
## 2  Charlie   50000   low
## 3    Davut   60000   low
## 4      Eve   90000  high

6.4.13.2 multiple np.where()

condition1 = df['salary'] <= 60000 
condition2 = (df['salary'] > 60000) & (df['salary'] <= 80000)


df['label'] = np.where(condition1, 'Low', 'High')
df['label'] = np.where(condition2, 'Medium', df['label'])


df.head()
##   employee  salary   label
## 0    Alice   70000  Medium
## 1      Bob   80000  Medium
## 2  Charlie   50000     Low
## 3    Davut   60000     Low
## 4      Eve   90000    High
# Use np.where for multiple conditions

condition_1 = df['salary'] <= 60000
condition_2 = df['salary'] <= 80000

df['label2'] = np.where(condition_1, 'Low', 
                                    np.where(condition_2, 'Medium', 'High') )

print(df)
##   employee  salary   label  label2
## 0    Alice   70000  Medium  Medium
## 1      Bob   80000  Medium  Medium
## 2  Charlie   50000     Low     Low
## 3    Davut   60000     Low     Low
## 4      Eve   90000    High    High
## 5    Frank   85000    High    High
## 6    Grace   75000  Medium  Medium
## 7   Hannah   65000  Medium  Medium

6.4.13.3 pd.cut() method

# Define thresholds for 'low', 'medium', and 'high' earners

bins = [0, 60000, 80000, float('inf')]
labels = ['Low', 'Medium', 'High']

# Use pd.cut to create a new column 'category'
df['category'] = pd.cut(df['salary'], bins=bins, labels=labels)

df.head()
##   employee  salary   label  label2 category
## 0    Alice   70000  Medium  Medium   Medium
## 1      Bob   80000  Medium  Medium   Medium
## 2  Charlie   50000     Low     Low      Low
## 3    Davut   60000     Low     Low      Low
## 4      Eve   90000    High    High     High

6.4.13.4 Alternative to np.where()

6.4.13.4.1 Using pd.Series.apply with a Lambda Function:

if else condition

df['high_low'] = df['salary'].apply(lambda x: 'high' if x > 85000 else 'low')
6.4.13.4.2 Using DataFrame.assign with np.where:
df = df.assign(high_low2 = np.where(df['salary'] > 85000, 'high', 'low'))
df.head()
##   employee  salary   label  label2 category high_low high_low2
## 0    Alice   70000  Medium  Medium   Medium      low       low
## 1      Bob   80000  Medium  Medium   Medium      low       low
## 2  Charlie   50000     Low     Low      Low      low       low
## 3    Davut   60000     Low     Low      Low      low       low
## 4      Eve   90000    High    High     High     high      high
6.4.13.4.3 Using DataFrame.loc:
df['new_label'] = 'low'

df.loc[df['salary'] > 85000, 'new_label'] = 'high'

6.4.13.5 np.select() method

data = {
    'employee': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank', 'Grace', 'Hannah'],
    'salary': [70000, 80000, 50000, 60000, 90000, 85000, 75000, 65000]
}

df = pd.DataFrame(data)
# Define conditions for categorizing salaries
conditions = [
    (df['salary'] <= 60000),
    (df['salary'] > 60000) & (df['salary'] <= 80000),
    (df['salary'] > 80000)
]

# Corresponding choices for each condition
choices = ['Low', 'Medium', 'High']

# Use np.select to create a new column 'earnings_category'
df['category2'] = np.select(conditions, choices)

df.head()
##   employee  salary category2
## 0    Alice   70000    Medium
## 1      Bob   80000    Medium
## 2  Charlie   50000       Low
## 3    David   60000       Low
## 4      Eve   90000      High

6.4.13.6 apply() method

# Custom function to categorize salary
def categorize_salary(salary):
    if salary <= 60000:
        return 'Low'
    elif 60000 < salary <= 80000:
        return 'Medium'
    else:
        return 'High'

# Apply the custom function to create a new column 'earnings_category'
df['category3'] = df['salary'].apply(categorize_salary)

df.head()
##   employee  salary category2 category3
## 0    Alice   70000    Medium    Medium
## 1      Bob   80000    Medium    Medium
## 2  Charlie   50000       Low       Low
## 3    David   60000       Low       Low
## 4      Eve   90000      High      High