Chapter 6 Object-Oriented Programming for Data Scientists
6.1 Introduction to OOP
Object-Oriented Programming (OOP) is a programming paradigm centered around the concept of objects, which are instances of classes. OOP is particularly beneficial in data science because it allows for modular, reusable, and maintainable code. By encapsulating data and functions that operate on the data into objects, OOP helps in organizing complex data workflows and creating scalable data science solutions.
Key Concepts of OOP:
Class: A blueprint for creating objects. A class defines a set of attributes and methods that the created objects will have.
Object: An instance of a class. Each object can have unique attribute values, even if they share the same class.
Encapsulation: Bundling of data (attributes) and methods (functions) that operate on the data into a single unit (class).
Inheritance: A mechanism where one class can inherit attributes and methods from another class.
Polymorphism: The ability to use a single interface to represent different underlying data types.
Abstraction: Hiding the complex implementation details and showing only the necessary features of an object.
6.1.1 What is Object-Oriented Programming?
Object-Oriented Programming is a way of designing software by defining data structures as objects that can contain both data and functions. These objects interact with one another to perform tasks and solve problems. OOP concepts are particularly useful in data science for creating custom data structures, implementing machine learning models, and managing data pipelines.
6.1.2 Advantages of OOP
Modularity: The source code for an object can be written and maintained independently of the source code for other objects.
Reusability: Once an object is created, it can be reused in different programs.
Scalability: OOP makes it easier to manage and scale large codebases.
Maintainability: Code is easier to maintain and modify over time.
6.1.3 OOP vs. Procedural Programming
Procedural programming focuses on functions, or procedures, that perform operations on data. In contrast, OOP focuses on objects that encapsulate data and the functions that operate on the data.
Procedural Programming Example:
# Procedural approach to calculating area of a rectangle
def calculate_area(length, width):
return length * width
length = 5
width = 3
area = calculate_area(length, width)
print(f"The area of the rectangle is {area}")
Object-Oriented Programming Example:
# OOP approach to calculating area of a rectangle
class Rectangle:
def __init__(self, length, width):
self.length = length
self.width = width
def calculate_area(self):
return self.length * self.width
rect = Rectangle(5, 3)
area = rect.calculate_area()
print(f"The area of the rectangle is {area}")
In the OOP example, the Rectangle
class encapsulates both the data (length and width) and the behavior (calculate_area) in one place. This encapsulation makes the code more modular and easier to manage.
6.1.4 Practical Example in Data Science
Let’s consider a practical example in data science where we create a class to represent a dataset and perform basic operations on it.
import pandas as pd
class DataSet:
def __init__(self, data):
self.data = pd.DataFrame(data)
def get_summary(self):
return self.data.describe()
def add_column(self, column_name, data):
self.data[column_name] = data
def get_column(self, column_name):
return self.data[column_name]
# Creating a dataset instance
data = {
'A': [1, 2, 3, 4, 5],
'B': [5, 4, 3, 2, 1]
}
dataset = DataSet(data)
print(dataset.get_summary())
# Adding a new column
dataset.add_column('C', [10, 20, 30, 40, 50])
print(dataset.get_column('C'))
In this example, the DataSet
class encapsulates the data in a Pandas DataFrame and provides methods to get a summary of the data, add a new column, and retrieve a specific column. This approach makes the code more organized and reusable, highlighting the advantages of OOP in data science.
6.2 Classes and Objects
In Object-Oriented Programming (OOP), a class is a blueprint for creating objects (instances of the class). A class defines a set of attributes and methods that the created objects will have. Understanding classes and objects is fundamental to leveraging OOP in your data science projects.
6.2.1 Understanding Classes
A class is a template for creating objects. It defines a set of attributes that will characterize any object created from the class and the methods that can be performed on these objects.
Basic Structure of a Class:
class MyClass:
# Class attribute
class_variable = "I am a class variable"
# Constructor
def __init__(self, instance_variable):
self.instance_variable = instance_variable
# Instance method
def display(self):
print(f"Class Variable: {self.class_variable}")
print(f"Instance Variable: {self.instance_variable}")
# Creating an instance of the class
obj = MyClass("I am an instance variable")
obj.display()
In this example, MyClass
has a class attribute class_variable
, an instance attribute instance_variable
, and an instance method display
.
6.2.2 Creating Objects
An object is an instance of a class. When a class is defined, no memory is allocated until an object of that class is created. Each object can have different attribute values, even if they share the same class.
Creating and Using Objects:
class DataPoint:
def __init__(self, x, y):
self.x = x
self.y = y
def display_point(self):
print(f"Point({self.x}, {self.y})")
# Creating objects
point1 = DataPoint(1, 2)
point2 = DataPoint(3, 4)
point1.display_point() # Output: Point(1, 2)
point2.display_point() # Output: Point(3, 4)
In this example, DataPoint
is a class with attributes x
and y
and a method display_point
. We create two instances of DataPoint
, each with different values for x
and y
.
6.2.3 The self
Parameter
In Python, the self
parameter is a reference to the current instance of the class. It is used to access variables that belong to the class. It must be the first parameter of any function in the class.
Using self
:
class Employee:
def __init__(self, name, salary):
self.name = name
self.salary = salary
def display_employee(self):
print(f"Name: {self.name}, Salary: {self.salary}")
# Creating an instance
emp1 = Employee("John", 50000)
emp1.display_employee() # Output: Name: John, Salary: 50000
Here, self.name
and self.salary
refer to the name
and salary
attributes of the instance emp1
.
6.2.4 Real-world Examples in Data Science
Let’s create a class to represent a simple linear regression model.
Simple Linear Regression Class:
import numpy as np
class SimpleLinearRegression:
def __init__(self):
self.coefficient = None
self.intercept = None
def fit(self, X, y):
X_mean = np.mean(X)
y_mean = np.mean(y)
self.coefficient = np.sum((X - X_mean) * (y - y_mean)) / np.sum((X - X_mean) ** 2)
self.intercept = y_mean - self.coefficient * X_mean
def predict(self, X):
return self.coefficient * X + self.intercept
# Sample data
X = np.array([1, 2, 3, 4, 5])
y = np.array([2, 3, 5, 6, 5])
# Creating an instance and fitting the model
model = SimpleLinearRegression()
model.fit(X, y)
# Making predictions
predictions = model.predict(X)
print("Predictions:", predictions)
In this example:
The
SimpleLinearRegression
class encapsulates the linear regression logic.The
fit
method calculates the coefficient and intercept based on the input data.The
predict
method uses the fitted model to make predictions on new data.
This approach makes the linear regression model reusable and easy to integrate into larger data science workflows.
6.3 Attributes and Methods
In Object-Oriented Programming (OOP), attributes and methods are key components of classes. Attributes are variables that hold data, while methods are functions that operate on this data. Understanding how to define and use attributes and methods is crucial for building effective data science applications using OOP.
6.3.1 Instance Attributes
Instance attributes are variables that hold data specific to an instance of a class. They are defined within the __init__
method and are prefixed with self
to indicate that they belong to the instance.
Defining and Using Instance Attributes:
class Student:
def __init__(self, name, grade):
self.name = name
self.grade = grade
def display_student(self):
print(f"Name: {self.name}, Grade: {self.grade}")
# Creating instances
student1 = Student("Alice", "A")
student2 = Student("Bob", "B")
student1.display_student() # Output: Name: Alice, Grade: A
student2.display_student() # Output: Name: Bob, Grade: B
In this example, name
and grade
are instance attributes that store data specific to each Student
instance.
6.3.1.1 Class Attributes
Class attributes are variables that are shared across all instances of a class. They are defined directly within the class body, outside any methods.
Defining and Using Class Attributes:
class School:
school_name = "Greenwood High" # Class attribute
def __init__(self, student_name):
self.student_name = student_name # Instance attribute
def display_student(self):
print(f"Student: {self.student_name}, School: {School.school_name}")
# Creating instances
student1 = School("Alice")
student2 = School("Bob")
student1.display_student() # Output: Student: Alice, School: Greenwood High
student2.display_student() # Output: Student: Bob, School: Greenwood High
In this example, school_name
is a class attribute shared by all instances of the School
class, while student_name
is an instance attribute unique to each instance.
6.3.1.2 Instance Methods
Instance methods are functions defined within a class that operate on instance attributes. They must include self
as their first parameter to access instance attributes and other methods.
Defining and Using Instance Methods:
class Rectangle:
def __init__(self, length, width):
self.length = length
self.width = width
def calculate_area(self):
return self.length * self.width
def display_area(self):
print(f"Area: {self.calculate_area()}")
# Creating an instance
rect = Rectangle(5, 3)
rect.display_area() # Output: Area: 15
In this example, calculate_area
and display_area
are instance methods that operate on the length
and width
attributes of the Rectangle
instance.
6.3.1.3 Class Methods and Static Methods
Class methods and static methods are two special types of methods in Python classes. Class methods operate on class attributes, while static methods are utility methods that do not operate on instance or class attributes.
Class Methods:
Class methods are defined using the @classmethod
decorator and take cls
(the class itself) as the first parameter.
class Circle:
pi = 3.14159 # Class attribute
def __init__(self, radius):
self.radius = radius
@classmethod
def from_diameter(cls, diameter):
radius = diameter / 2
return cls(radius)
def calculate_area(self):
return Circle.pi * (self.radius ** 2)
# Creating an instance using the class method
circle = Circle.from_diameter(10)
print(f"Radius: {circle.radius}, Area: {circle.calculate_area()}") # Output: Radius: 5.0, Area: 78.53975
Static Methods:
Static methods are defined using the @staticmethod
decorator and do not take self
or cls
as a parameter.
class MathUtils:
@staticmethod
def add(a, b):
return a + b
# Using the static method
result = MathUtils.add(5, 3)
print(f"Result: {result}") # Output: Result: 8
In this example, add
is a static method that performs addition without accessing any class or instance attributes.
6.3.2 Practical Examples with Data Science Models
Let’s create a class to represent a simple linear regression model with attributes and methods tailored to data science.
Linear Regression Class with Attributes and Methods:
import numpy as np
class SimpleLinearRegression:
def __init__(self):
self.coefficient = None
self.intercept = None
def fit(self, X, y):
X_mean = np.mean(X)
y_mean = np.mean(y)
self.coefficient = np.sum((X - X_mean) * (y - y_mean)) / np.sum((X - X_mean) ** 2)
self.intercept = y_mean - self.coefficient * X_mean
def predict(self, X):
return self.coefficient * X + self.intercept
def score(self, X, y):
y_pred = self.predict(X)
ss_total = np.sum((y - np.mean(y)) ** 2)
ss_residual = np.sum((y - y_pred) ** 2)
r2_score = 1 - (ss_residual / ss_total)
return r2_score
# Sample data
X = np.array([1, 2, 3, 4, 5])
y = np.array([2, 3, 5, 6, 5])
# Creating an instance and fitting the model
model = SimpleLinearRegression()
model.fit(X, y)
# Making predictions
predictions = model.predict(X)
print("Predictions:", predictions)
# Calculating R-squared score
r2 = model.score(X, y)
print("R-squared score:", r2)
In this example:
- The SimpleLinearRegression
class encapsulates the logic for fitting a linear regression model, making predictions, and calculating the R-squared score.
- The fit
method calculates the coefficient and intercept.
- The predict
method uses the fitted model to make predictions on new data.
- The score
method calculates the R-squared score to evaluate the model’s performance.
This approach demonstrates the power of using attributes and methods in a class to organize and encapsulate the functionality of a data science model.
- Encapsulation
- Public vs. Private Attributes
- Getter and Setter Methods
- Property Decorators
- Data Encapsulation in Data Science Projects
- Inheritance
- What is Inheritance?
- Single Inheritance
- Multiple Inheritance
- Overriding Methods
- The
super()
Function - Reusing Code in Data Science Workflows
- Polymorphism
- Method Overloading
- Method Overriding
- Duck Typing
- Polymorphism in Data Processing Pipelines
- Abstraction
- Abstract Classes and Methods
- The
abc
Module - Abstracting Common Data Science Tasks
- Special Methods
- The
__init__
Method - Other Magic Methods (
__str__
,__repr__
,__len__
, etc.) - Operator Overloading
- Enhancing Data Science Classes with Special Methods
- The
- Design Patterns in OOP
- Introduction to Design Patterns
- Common Design Patterns (Singleton, Factory, Observer, etc.)
- Design Patterns for Data Science Projects
- OOP with Data Science Libraries
- Integrating OOP with Libraries like Pandas, NumPy, and Scikit-learn
- Building Custom Transformers and Pipelines
- Extending Existing Classes
- Best Practices in OOP
- Writing Readable and Maintainable Code
- Principles of OOP (SOLID)
- Avoiding Common Pitfalls
- Best Practices in Data Science Context
- Case Study: Building a Data Science Application
- Problem Statement
- Designing the Class Structure
- Implementing the Solution
- Testing and Debugging
- Deploying Data Science Models Using OOP
- Advanced Topics
- Metaclasses
- Decorators and Descriptors
- MRO (Method Resolution Order)
- Advanced OOP Techniques in Data Science
- Summary and Key Takeaways
- Recap of Key Concepts
- Tips for Mastering OOP in Data Science
- Exercises and Projects
- Practice Problems
- Mini Projects
- Solutions
6.3.3 Dynamic Variables
Dynamic variables are typically instance variables in the context of classes. They are created and managed at runtime, usually within the methods of a class. Each instance (or object) of a class can have different values for these variables.
6.3.3.1 Characteristics of Dynamic Variables:
- Instance-specific: Each instance of a class can have its own unique values for these variables.
- Defined within methods: Typically created and accessed using the
self
keyword within instance methods. - Dynamic in nature: Can be added, modified, or deleted at runtime.
Example:
class DynamicExample:
def __init__(self, value):
self.dynamic_variable = value
def update_value(self, new_value):
self.dynamic_variable = new_value
# Creating instances
obj1 = DynamicExample(10)
obj2 = DynamicExample(20)
print(obj1.dynamic_variable) # Output: 10
print(obj2.dynamic_variable) # Output: 20
# Updating dynamic variable
obj1.update_value(30)
print(obj1.dynamic_variable) # Output: 30
6.3.4 Static Variables
Static variables, also known as class variables, are shared across all instances of a class. They are defined at the class level and are not tied to any specific instance. These variables are accessed using the class name or any instance.
6.3.4.1 Characteristics of Static Variables:
- Class-wide: Shared among all instances of the class.
- Defined within the class but outside any methods: Typically declared directly within the class body.
- Consistent across instances: Changing the value affects all instances.
Example:
class StaticExample:
static_variable = 42 # This is a static variable
def __init__(self, value):
self.instance_variable = value # This is a dynamic variable
# Creating instances
obj1 = StaticExample(10)
obj2 = StaticExample(20)
print(obj1.static_variable) # Output: 42
print(obj2.static_variable) # Output: 42
# Updating static variable through class
StaticExample.static_variable = 100
print(obj1.static_variable) # Output: 100
print(obj2.static_variable) # Output: 100
# Accessing static variable through an instance
obj1.static_variable = 200
print(obj1.static_variable) # Output: 200
print(obj2.static_variable) # Output: 100
# The class variable itself remains unchanged unless accessed through the class
print(StaticExample.static_variable) # Output: 100
In this example:
- static_variable
is a class variable (static), shared across all instances.
- instance_variable
is an instance variable (dynamic), unique to each instance.
6.4 Pandas Exercises
6.4.1 Question
Write a Pandas program to create and display a one-dimensional array-like object containing an array of data using Pandas module.
import pandas as pd
# Create a Pandas Series
data = [10, 20, 30, 40, 50]
series = pd.Series(data)
# Display the Series
print(series)
## 0 10
## 1 20
## 2 30
## 3 40
## 4 50
## dtype: int64
6.4.2 Question
Write a Pandas program to convert a Pandas module Series to Python list and it’s type.
import pandas as pd
data = pd.Series([1, 2, 3, 4, 5])
# Convert series to python list
data_list = list(data) ### general solution
data_list = data.to_list() ### pandas optimized
print(type(data_list))
## <class 'list'>
6.4.3 Question
Write a Pandas program to add, subtract, multiple and divide two Pandas Series.
serias_a = pd.Series([2, 4, 6, 8, 10])
serias_b = pd.Series([1, 3, 5, 7, 9])
# addition
print('addition\n', serias_a + serias_b)
## addition
## 0 3
## 1 7
## 2 11
## 3 15
## 4 19
## dtype: int64
##
## subtraction
## 0 1
## 1 1
## 2 1
## 3 1
## 4 1
## dtype: int64
##
## multiplication
## 0 2
## 1 12
## 2 30
## 3 56
## 4 90
## dtype: int64
##
## division
## 0 2.000000
## 1 1.333333
## 2 1.200000
## 3 1.142857
## 4 1.111111
## dtype: float64
6.4.4 Question
Write a Pandas program to compare the elements of the two Pandas Series.
## 0 False
## 1 False
## 2 False
## 3 False
## 4 True
## dtype: bool
## 0 True
## 1 True
## 2 True
## 3 True
## 4 False
## dtype: bool
## 0 False
## 1 False
## 2 False
## 3 False
## 4 False
## dtype: bool
## 0 True
## 1 True
## 2 True
## 3 True
## 4 False
## dtype: bool
6.4.5 Question
Write a Pandas program to convert a dictionary to a Pandas series.
## 0 1
## 1 2
## 2 3
## dtype: int64
## a 1
## b 2
## c 3
## dtype: int64
6.4.6 sorting
sort_values
is a method in the pandas library used to sort the values in a DataFrame or Series.
Purpose:
sort_values
is used to sort a DataFrame or Series by one or more columns or by the values in the Series.Syntax:
6.4.6.1 Examples:
Sample dataframe
import pandas as pd
data = {
'NAME': ['David', 'Alice', 'Charlie', 'Bob',],
'AGE': [25, 30, 40, 35],
'SALARY': [50000, 600000, 55000, 700000]
}
df = pd.DataFrame(data)
df
## NAME AGE SALARY
## 0 David 25 50000
## 1 Alice 30 600000
## 2 Charlie 40 55000
## 3 Bob 35 700000
- Sort by a Single Column:
## NAME AGE SALARY
## 1 Alice 30 600000
## 3 Bob 35 700000
## 2 Charlie 40 55000
## 0 David 25 50000
- Sort by Multiple Columns:
# Sort by 'age' in ascending order, then by 'salary' in descending order
sorted_df = df.sort_values(by=['AGE', 'SALARY'], ascending=[True, False])
print(sorted_df)
## unexpected indent (<string>, line 3)
- Handling
NaN
Values:
data = {
'NAME': ['Alice', 'Bob', 'Charlie', 'David'],
'AGE': [25, 30, None, 40]
}
df = pd.DataFrame(data)
# Sort by 'age' and place NaN values at the start
sorted_df = df.sort_values(by='AGE', na_position='first')
print(sorted_df)
## NAME AGE
## 2 Charlie NaN
## 0 Alice 25.0
## 1 Bob 30.0
## 3 David 40.0
6.4.7 groupby operation
The groupby
operation in pandas is a powerful tool for aggregating and transforming data.
It allows you to split your DataFrame into groups based on one or more columns, apply functions to each group, and then combine the results back into a DataFrame or Series.
import pandas as pd
data = {
'department': ['Sales', 'Sales', 'HR', 'HR', 'IT', 'IT', 'Finance', 'Finance'],
'employee': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank', 'Grace', 'Hannah'],
'salary': [70000, 80000, 50000, 60000, 90000, 85000, 75000, 65000],
'bonus': [5000, 7000, 4000, 6000, 9000, 8500, 7500, 6500]
}
df_company = pd.DataFrame(data)
df_company
## department employee salary bonus
## 0 Sales Alice 70000 5000
## 1 Sales Bob 80000 7000
## 2 HR Charlie 50000 4000
## 3 HR David 60000 6000
## 4 IT Eve 90000 9000
## 5 IT Frank 85000 8500
## 6 Finance Grace 75000 7500
## 7 Finance Hannah 65000 6500
6.4.7.1 Group by a Single Column
Suppose you want to find the average salary by department.
Final object is a series with index is department.
# Group by 'department' and calculate the mean salary
grouped_df = df_company.groupby('department')['salary'].mean()
print(grouped_df)
## department
## Finance 70000.0
## HR 55000.0
## IT 87500.0
## Sales 75000.0
## Name: salary, dtype: float64
## <class 'pandas.core.series.Series'>
# Convert the resulting Series into a DataFrame
grouped_df = grouped_df.reset_index()
# Rename the column for clarity
grouped_df.columns = ['department', 'average_salary']
grouped_df
## department average_salary
## 0 Finance 70000.0
## 1 HR 55000.0
## 2 IT 87500.0
## 3 Sales 75000.0
6.4.7.2 Group by Multiple Columns:
You can group by multiple columns.
For example, find the total compensation (salary + bonus) for each employee in each department.
# Group by 'department' and 'employee', and calculate total compensation
df_company['total_compensation'] = df_company['salary'] + df_company['bonus']
grouped_df = df_company.groupby(['department', 'employee'])['total_compensation'].sum()
print(grouped_df)
## department employee
## Finance Grace 82500
## Hannah 71500
## HR Charlie 54000
## David 66000
## IT Eve 99000
## Frank 93500
## Sales Alice 75000
## Bob 87000
## Name: total_compensation, dtype: int64
## <class 'pandas.core.series.Series'>
6.4.7.3 Aggregation Functions:
You can use multiple aggregation functions on the grouped data.
For example, find the sum and mean of the salary and bonus for each department.
# Group by 'department' and calculate sum and mean of 'salary' and 'bonus'
agg_df = df_company.groupby('department').agg({ 'salary': ['sum', 'mean'],
'bonus': ['sum', 'mean']
})
print(agg_df)
## salary bonus
## sum mean sum mean
## department
## Finance 140000 70000.0 14000 7000.0
## HR 110000 55000.0 10000 5000.0
## IT 175000 87500.0 17500 8750.0
## Sales 150000 75000.0 12000 6000.0
## <class 'pandas.core.frame.DataFrame'>
6.4.7.4 groupby and unnested data
agg_df = df_company.groupby('department').agg(
total_salary = pd.NamedAgg(column='salary', aggfunc='sum'),
avg_salary = pd.NamedAgg(column='salary', aggfunc='mean'),
total_bonus = pd.NamedAgg(column='bonus', aggfunc='sum'),
avg_bonus = pd.NamedAgg(column='bonus', aggfunc='mean')
)
print(agg_df)
## total_salary avg_salary total_bonus avg_bonus
## department
## Finance 140000 70000.0 14000 7000.0
## HR 110000 55000.0 10000 5000.0
## IT 175000 87500.0 17500 8750.0
## Sales 150000 75000.0 12000 6000.0
agg_df = df_company.groupby('department').agg({
'salary': 'sum', 'bonus': 'sum'
}).rename(columns={
'salary': 'total_salary',
'bonus': 'total_bonus'
})
agg_df['avg_salary'] = df_company.groupby('department')['salary'].mean()
agg_df['avg_bonus'] = df_company.groupby('department')['bonus'].mean()
print(agg_df)
## total_salary total_bonus avg_salary avg_bonus
## department
## Finance 140000 14000 70000.0 7000.0
## HR 110000 10000 55000.0 5000.0
## IT 175000 17500 87500.0 8750.0
## Sales 150000 12000 75000.0 6000.0
6.4.7.5 Transformation
data = {
'department': ['Sales', 'Sales', 'HR', 'HR', 'IT', 'IT', 'Finance', 'Finance'],
'employee': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank', 'Grace', 'Hannah'],
'salary': [70000, 80000, 50000, 60000, 90000, 85000, 95000, 65000],
}
df = pd.DataFrame(data)
# Group by 'department' and calculate sum of 'salary' with transform
df['total_salary'] = df.groupby('department')['salary'].transform('sum')
print(df)
## department employee salary total_salary
## 0 Sales Alice 70000 150000
## 1 Sales Bob 80000 150000
## 2 HR Charlie 50000 110000
## 3 HR David 60000 110000
## 4 IT Eve 90000 175000
## 5 IT Frank 85000 175000
## 6 Finance Grace 95000 160000
## 7 Finance Hannah 65000 160000
## department employee salary total_salary
## 4 IT Eve 90000 175000
## 5 IT Frank 85000 175000
## 6 Finance Grace 95000 160000
6.4.7.6 Example
Lets perform a standardization operation on the salary column within each group defined by the department column
# Group by 'department' and calculate sum of 'salary' with transform
df2 = df.copy()
df2['demeaned_salary'] = df2.groupby('department')['salary'].transform(lambda x: (x - x.mean()))
print(df2)
## department employee salary total_salary demeaned_salary
## 0 Sales Alice 70000 150000 -5000.0
## 1 Sales Bob 80000 150000 5000.0
## 2 HR Charlie 50000 110000 -5000.0
## 3 HR David 60000 110000 5000.0
## 4 IT Eve 90000 175000 2500.0
## 5 IT Frank 85000 175000 -2500.0
## 6 Finance Grace 95000 160000 15000.0
## 7 Finance Hannah 65000 160000 -15000.0
6.4.7.7 Filtering Groups:
You can filter out groups that meet a specific condition.
For example, keep only departments with a total salary of more than $150,000.
# Group by 'department' and filter departments with total salary > 150,000
filtered_df = df.groupby('department').filter(lambda x: x['salary'].sum() > 150000)
print(filtered_df)
## department employee salary total_salary
## 4 IT Eve 90000 175000
## 5 IT Frank 85000 175000
## 6 Finance Grace 95000 160000
## 7 Finance Hannah 65000 160000
df2['total_salary'] = df.groupby('department')['salary'].transform('sum')
df2 = df2[df2['total_salary'] > 150000]
df2
## department employee salary total_salary demeaned_salary
## 4 IT Eve 90000 175000 2500.0
## 5 IT Frank 85000 175000 -2500.0
## 6 Finance Grace 95000 160000 15000.0
## 7 Finance Hannah 65000 160000 -15000.0
6.4.7.8 Examples
df2 = df.copy()
df2 = df2.groupby('department').agg({'salary': ['sum', 'mean', 'count', 'min', 'max']})
df2
## salary
## sum mean count min max
## department
## Finance 160000 80000.0 2 65000 95000
## HR 110000 55000.0 2 50000 60000
## IT 175000 87500.0 2 85000 90000
## Sales 150000 75000.0 2 70000 80000
`
6.4.8 Sample Solutions
these are from interviewquery
6.4.8.1 Question
import pandas as pd
name_list = ["Tim Voss", "Nicole Johnson", "Elsa Williams", "John James", "Catherine Jones"]
age_list = [19, 20, 21, 20, 23]
color_list = ["red", "yellow", "green", "blue", "green"]
grades = [91, 95, 82, 75, 93]
students = {"name" : name_list,
"age" : age_list,
"favorite_color" : color_list,
"grade" : grades}
students_df = pd.DataFrame(students)
students_df
## name age favorite_color grade
## 0 Tim Voss 19 red 91
## 1 Nicole Johnson 20 yellow 95
## 2 Elsa Williams 21 green 82
## 3 John James 20 blue 75
## 4 Catherine Jones 23 green 93
Write a function named grades_colors
to select only the rows where the student’s favorite color is green or red and their grade is above 90.
def grades_colors(df):
df = df[(df['favorite_color'].isin(['green', 'red'])) & (df['grade'] > 90)]
return df
grades_colors(students_df)
## name age favorite_color grade
## 0 Tim Voss 19 red 91
## 4 Catherine Jones 23 green 93
Alternative
## name age favorite_color grade
## 0 Tim Voss 19 red 91
## 4 Catherine Jones 23 green 93
Using query method
## name age favorite_color grade
## 0 Tim Voss 19 red 91
## 4 Catherine Jones 23 green 93
Using loc
students_df.loc[(students_df['favorite_color'].isin(['green', 'red'])) &
(students_df['grade'] > 90)
]
## name age favorite_color grade
## 0 Tim Voss 19 red 91
## 4 Catherine Jones 23 green 93
Alternative
6.4.8.2 Question
You are given a dataframe with a single column, ‘var’.
Calculated the t-value for the mean of ‘var’ against a null hypothesis that \(\mu = \mu_0\)
Note: You do not have to calculate the p-value of the test or run the test.
var_data = [2,3,4,5,6,7,8,8,10]
df = pd.DataFrame({"var": var_data})
mu_0 = 5
def t_score(mu_0, df):
n = df['var'].count()
sample_mean = df['var'].mean()
sample_std = df['var'].std()
t = (sample_mean - mu_0) / (sample_std / pow(n, 1/2))
return t
t_score(mu_0, df)
## 1.018055620761245
6.4.9 Question
Given a dataframe with three columns: client_id, ranking, value
Write a function to fill the NaN values in the value column with the previous non-NaN value from the same client_id ranked in ascending order.
If there doesn’t exist a previous client_id then return the previous value.
client_id = [1001, 1001, 1001, 1002, 1002, 1002, 1003, 1003]
ranking = [1, 2, 3, 1, 2, 3, 1, 2]
value = [1000, pd.NA, 1200, 1500, 1250, pd.NA, 1100, pd.NA]
clients_df = pd.DataFrame({
'client_id': client_id,
'ranking': ranking,
'value': value
})
def previous_nan_values(clients_df):
clients_df = clients_df.sort_values(by=['client_id', 'ranking'])
clients_df['value2'] = clients_df.groupby('client_id')['value'].ffill()
return clients_df
previous_nan_values(clients_df)
## client_id ranking value value2
## 0 1001 1 1000 1000
## 1 1001 2 <NA> 1000
## 2 1001 3 1200 1200
## 3 1002 1 1500 1500
## 4 1002 2 1250 1250
## 5 1002 3 <NA> 1250
## 6 1003 1 1100 1100
## 7 1003 2 <NA> 1100
6.4.10 Data Filtering and Selection:
sample data
data = {
'department': ['Sales', 'Sales', 'HR', 'HR', 'IT', 'IT', 'Finance', 'Finance'],
'employee': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank', 'Grace', 'Hannah'],
'salary': [70000, 80000, 50000, 60000, 90000, 85000, 95000, 65000],
}
df = pd.DataFrame(data)
df
## department employee salary
## 0 Sales Alice 70000
## 1 Sales Bob 80000
## 2 HR Charlie 50000
## 3 HR David 60000
## 4 IT Eve 90000
## 5 IT Frank 85000
## 6 Finance Grace 95000
## 7 Finance Hannah 65000
6.4.10.1 filtering rows
## department employee salary
## 4 IT Eve 90000
## department employee salary
## 2 HR Charlie 50000
## 3 HR David 60000
## 4 IT Eve 90000
## 5 IT Frank 85000
## department employee salary
## 4 IT Eve 90000
## 5 IT Frank 85000
## 6 Finance Grace 95000
## department employee salary
## 1 Sales Bob 80000
## 5 IT Frank 85000
## department employee salary
## 4 IT Eve 90000
## 5 IT Frank 85000
## 6 Finance Grace 95000
## 7 Finance Hannah 65000
## department employee salary
## 0 Sales Alice 70000
## 1 Sales Bob 80000
## 2 HR Charlie 50000
## 3 HR David 60000
## department employee salary
## 0 Sales Alice 70000
## 1 Sales Bob 80000
## 2 HR Charlie 50000
6.4.10.2 filtering columns
## employee salary
## 0 Alice 70000
## 1 Bob 80000
## 2 Charlie 50000
## 3 David 60000
## 4 Eve 90000
## 5 Frank 85000
## 6 Grace 95000
## 7 Hannah 65000
## employee salary
## 0 Alice 70000
## 1 Bob 80000
## 2 Charlie 50000
## 3 David 60000
## department employee salary
## 0 Sales Alice 70000
## 1 Sales Bob 80000
## 2 HR Charlie 50000
6.4.11 Aggregation and Grouping:
## department
## Finance 80000.0
## HR 55000.0
## IT 87500.0
## Sales 75000.0
## Name: salary, dtype: float64
6.4.13 recoding
Sample Data
data = {
'employee': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank', 'Grace', 'Hannah'],
'salary': [70000, 80000, 50000, 60000, 90000, 85000, 75000, 65000]
}
df = pd.DataFrame(data)
df.head()
## employee salary
## 0 Alice 70000
## 1 Bob 80000
## 2 Charlie 50000
## 3 David 60000
## 4 Eve 90000
6.4.13.1 np.where() method
# this condition creates a series of boolean
condition = df['salary'] > 85000
df['label'] = np.where(condition, 'high', 'low')
df.head()
## employee salary label
## 0 Alice 70000 low
## 1 Bob 80000 low
## 2 Charlie 50000 low
## 3 David 60000 low
## 4 Eve 90000 high
condition = df['employee'] == 'David'
df['employee'] = np.where(condition, 'Davut', df['employee'])
df.head()
## employee salary label
## 0 Alice 70000 low
## 1 Bob 80000 low
## 2 Charlie 50000 low
## 3 Davut 60000 low
## 4 Eve 90000 high
6.4.13.2 multiple np.where()
condition1 = df['salary'] <= 60000
condition2 = (df['salary'] > 60000) & (df['salary'] <= 80000)
df['label'] = np.where(condition1, 'Low', 'High')
df['label'] = np.where(condition2, 'Medium', df['label'])
df.head()
## employee salary label
## 0 Alice 70000 Medium
## 1 Bob 80000 Medium
## 2 Charlie 50000 Low
## 3 Davut 60000 Low
## 4 Eve 90000 High
# Use np.where for multiple conditions
condition_1 = df['salary'] <= 60000
condition_2 = df['salary'] <= 80000
df['label2'] = np.where(condition_1, 'Low',
np.where(condition_2, 'Medium', 'High') )
print(df)
## employee salary label label2
## 0 Alice 70000 Medium Medium
## 1 Bob 80000 Medium Medium
## 2 Charlie 50000 Low Low
## 3 Davut 60000 Low Low
## 4 Eve 90000 High High
## 5 Frank 85000 High High
## 6 Grace 75000 Medium Medium
## 7 Hannah 65000 Medium Medium
6.4.13.3 pd.cut() method
# Define thresholds for 'low', 'medium', and 'high' earners
bins = [0, 60000, 80000, float('inf')]
labels = ['Low', 'Medium', 'High']
# Use pd.cut to create a new column 'category'
df['category'] = pd.cut(df['salary'], bins=bins, labels=labels)
df.head()
## employee salary label label2 category
## 0 Alice 70000 Medium Medium Medium
## 1 Bob 80000 Medium Medium Medium
## 2 Charlie 50000 Low Low Low
## 3 Davut 60000 Low Low Low
## 4 Eve 90000 High High High
6.4.13.4 Alternative to np.where()
6.4.13.4.2 Using DataFrame.assign
with np.where
:
## employee salary label label2 category high_low high_low2
## 0 Alice 70000 Medium Medium Medium low low
## 1 Bob 80000 Medium Medium Medium low low
## 2 Charlie 50000 Low Low Low low low
## 3 Davut 60000 Low Low Low low low
## 4 Eve 90000 High High High high high
6.4.13.5 np.select() method
data = {
'employee': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank', 'Grace', 'Hannah'],
'salary': [70000, 80000, 50000, 60000, 90000, 85000, 75000, 65000]
}
df = pd.DataFrame(data)
# Define conditions for categorizing salaries
conditions = [
(df['salary'] <= 60000),
(df['salary'] > 60000) & (df['salary'] <= 80000),
(df['salary'] > 80000)
]
# Corresponding choices for each condition
choices = ['Low', 'Medium', 'High']
# Use np.select to create a new column 'earnings_category'
df['category2'] = np.select(conditions, choices)
df.head()
## employee salary category2
## 0 Alice 70000 Medium
## 1 Bob 80000 Medium
## 2 Charlie 50000 Low
## 3 David 60000 Low
## 4 Eve 90000 High
6.4.13.6 apply() method
# Custom function to categorize salary
def categorize_salary(salary):
if salary <= 60000:
return 'Low'
elif 60000 < salary <= 80000:
return 'Medium'
else:
return 'High'
# Apply the custom function to create a new column 'earnings_category'
df['category3'] = df['salary'].apply(categorize_salary)
df.head()
## employee salary category2 category3
## 0 Alice 70000 Medium Medium
## 1 Bob 80000 Medium Medium
## 2 Charlie 50000 Low Low
## 3 David 60000 Low Low
## 4 Eve 90000 High High