Chapter 4 Pandas Library
run pip3 install pandas
or
run !pip install pandas
on rstudio terminal or mac terminal or jupyter notebook
4.1 Data Frames
4.1.1 Data frame creation
Sometimes you may need to create a data frame
## Empty DataFrame
## Columns: []
## Index: []
## Empty DataFrame
## Columns: [A]
## Index: []
Is df2 empty?
There is column name but it is still empty.
## True
How to create a sample data frame?
From a dictionary
my_dict = {'Col_1': [1,2,3,4],
'Col_2': ['a', 'b', 'c', 'd'],
'Col_3': 1984}
my_df = pd.DataFrame(my_dict)
my_df[:5]
## Col_1 Col_2 Col_3
## 0 1 a 1984
## 1 2 b 1984
## 2 3 c 1984
## 3 4 d 1984
a_list = [1,2,3,4]
b_list = ['a', 'b', 'c', 'd']
##
df3 = pd.DataFrame({'var1': a_list,
'var2': b_list})
df3
## var1 var2
## 0 1 a
## 1 2 b
## 2 3 c
## 3 4 d
4.1.2 Read csv files
## [Errno 2] No such file or directory: 'data/college.csv'
## name 'df_csv' is not defined
4.1.3 Data attributes
4.1.3.2 Columns
How to get the column names?
–> Column names are stored in columns
attribute.
## name 'df_csv' is not defined
## name 'df_csv' is not defined
## name 'df_csv' is not defined
## name 'df_csv' is not defined
4.1.3.4 index of the data frame
## name 'df_csv' is not defined
## name 'df_csv' is not defined
## name 'df_csv' is not defined
4.1.3.5 assign a new index
## name 'df_csv' is not defined
## name 'df' is not defined
## name 'df' is not defined
4.1.3.6 index to column
## name 'df_csv' is not defined
## name 'df' is not defined
## name 'df' is not defined
- calling
keys()
function: output is similar to columns attributes
## name 'df_csv' is not defined
column.values
method returns an array of index.
## name 'df_csv' is not defined
Using tolist()
method with values with given the list of columns.
## name 'df_csv' is not defined
Using sorted()
method : sorted()
method will return the list of columns sorted in alphabetical order.
## ['var1', 'var2']
4.1.4 Create or add a new column
You can create a new column in a pandas DataFrame based on the values of existing columns using various methods, such as assignment, the apply()
function, or vectorized operations.
Here are some examples:
4.1.6 rename/assign new column names
Renaming or assigning new column names in pandas DataFrame is a common operation, and there are several ways to accomplish it. Here are some examples with explanations:
4.1.6.1 Using the rename() method:
The rename()
method allows you to rename columns by specifying a dictionary where keys are the current column names and values are the new column names.
## name 'df_csv' is not defined
## name 'df' is not defined
## name 'df' is not defined
Setting inplace=True will modify the DataFrame in place, otherwise a new DataFrame with the renamed columns will be returned.
4.1.6.2 Direct assignment with dictionary:
You can directly assign new column names to the columns attribute of the DataFrame using a dictionary.
Note: you need to specify all the column names in the dictionary.
## name 'df_csv' is not defined
## name 'df' is not defined
## name 'df' is not defined
4.1.6.3 Using list assignment:
You can directly assign a list of column names to the columns attribute of the DataFrame.
Note: Make sure the length of the list matches the number of columns in the DataFrame.
## name 'df_csv' is not defined
## name 'df' is not defined
## name 'df' is not defined
4.1.6.4 Assigning new column names during DataFrame creation:
You can provide the columns parameter when creating a DataFrame to specify the column names.
df = pd.DataFrame(data, columns=['column1', 'column2', 'column3'])
4.1.6.5 to uppercase all column names
## name 'df_csv' is not defined
## name 'df' is not defined
## name 'df' is not defined
4.1.6.6 rename
Renaming column names is a standard procedure most of the time. We may need to standardize column names, mostly make them clean.
There is a nice link here.
Lets create a sample data frame
df = pd.DataFrame({'A?la': [1,2,3,4],
'PYTHON': ['a', 'a', 'a', 'a'],
'col 3': ['1','1','b','b']})
df
## A?la PYTHON col 3
## 0 1 a 1
## 1 2 a 1
## 2 3 a b
## 3 4 a b
Method 1: Rename Specific Columns
## A?la Python col 3
## 0 1 a 1
## 1 2 a 1
## 2 3 a b
## 3 4 a b
## ['A?la', 'PYTHON', 'col 3']
Method 2: Rename All Columns
## ['new_name1', 'new_name2', 'new_name3']
Method 3: Replace Specific Characters in Columns
### set problematic names
problem_names = ['VAR1$$$', 'var2','Var3###']
df.columns = problem_names
### removing bad characters
df.columns = df.columns.str.replace('$', '', regex = True)
df.columns = df.columns.str.replace('#', '', regex = True)
list(df)
## ['VAR1', 'var2', 'Var3']
clean column names
Generally we expect clean columns to be: 1. short 2. meaningful 3. no space 4. no special character 5. probably lowercase
### set problematic names
problem_names = ['VAR $1', 'var #2','Var ? 3 ']
df.columns = problem_names
## Column names: remove white spaces and convert to lower case
df.columns= df.columns.str.strip()
## Column names: convert to lower case
df.columns = df.columns.str.lower()
## Column names: convert to lower case
df = df.rename(columns = str.lower)
## removing bad characters
df.columns = df.columns.str.replace('$', '', regex = True)
df.columns = df.columns.str.replace('#', '', regex = True)
df.columns = df.columns.str.replace('?', '', regex = True)
df.columns = df.columns.str.replace(' ', '', regex = True)
list(df)
## ['var1', 'var2', 'var3']
- sometimes we need them to be uppercase
## ['VAR1', 'VAR2', 'VAR3']
4.1.7 Subsetting
4.1.7.2 Label based: df.loc[row_label, column_label]
Selecting Rows and Columns by Labels:
You can use df.loc[] to select rows and columns of a DataFrame using labels. The syntax is df.loc[row_label, column_label].
You can specify a single label or a list of labels for both rows and columns.
row label is the index.
## name 'df_csv' is not defined
## name 'df_csv' is not defined
## name 'df_csv' is not defined
## name 'df_csv' is not defined
## name 'df_csv' is not defined
4.1.8 copy
When you use loc/iloc to create a subset of the DataFrame, you are also creating a view into the original DataFrame. Modifications to the subset will reflect in the original DataFrame, and vice versa.
subset = df.loc['row_label1':'row_label5', 'column_labelA':'column_labelB']
subset = df.iloc[0:5, 1:3] # Select rows 0 to 4 and columns 1 to 2
To create an independent copy of the DataFrame, you can use the copy() method. This ensures that modifications to the copied DataFrame do not affect the original DataFrame.
4.2 Recoding Columns in a Pandas DataFrame
Data manipulation is a crucial part of any data science project. One common task is recoding columns in a DataFrame, which involves transforming the values in a column based on certain rules or mapping schemes. This can be necessary for various reasons, such as preparing data for analysis, converting categorical variables, or cleaning data. In this section, we will explore different methods to recode columns in a Pandas DataFrame.
4.2.1 Using map
and replace
The map
and replace
methods are convenient for simple value transformations and mappings.
Using map
for Recoding:
The map
method is useful for replacing values in a Series (a single column in a DataFrame) according to a mapping dictionary.
import pandas as pd
# Sample DataFrame
data = {
'Category': ['A', 'B', 'C', 'A', 'B', 'C']
}
df = pd.DataFrame(data)
# Mapping dictionary
category_map = {
'A': 'Alpha',
'B': 'Beta',
'C': 'Gamma'
}
# Recoding using map
df['Category_Recoded'] = df['Category'].map(category_map)
print(df)
Output:
Category Category_Recoded
0 A Alpha
1 B Beta
2 C Gamma
3 A Alpha
4 B Beta
5 C Gamma
In this example, we created a mapping dictionary category_map
and used the map
method to create a new column Category_Recoded
with the recoded values.
Using replace
for Recoding:
The replace
method can be used for more flexible replacements, including partial string replacements and handling missing values.
# Sample DataFrame with NaN values
data = {
'Category': ['A', 'B', 'C', 'A', 'B', 'C', None]
}
df = pd.DataFrame(data)
# Recoding using replace
df['Category_Recoded'] = df['Category'].replace({
'A': 'Alpha',
'B': 'Beta',
'C': 'Gamma',
None: 'Unknown'
})
print(df)
Output:
Category Category_Recoded
0 A Alpha
1 B Beta
2 C Gamma
3 A Alpha
4 B Beta
5 C Gamma
6 None Unknown
In this example, we used the replace
method to handle missing values (None
) and recode the Category
column.
4.2.1.1 2. Using apply
for Custom Functions
For more complex recoding logic, you can use the apply
method with a custom function.
# Sample DataFrame
data = {
'Score': [85, 92, 78, 88, 95, 70]
}
df = pd.DataFrame(data)
# Custom function to recode scores
def recode_score(score):
if score >= 90:
return 'A'
elif score >= 80:
return 'B'
elif score >= 70:
return 'C'
else:
return 'D'
# Recoding using apply
df['Grade'] = df['Score'].apply(recode_score)
print(df)
Output:
Score Grade
0 85 B
1 92 A
2 78 C
3 88 B
4 95 A
5 70 C
In this example, we defined a custom function recode_score
that assigns letter grades based on numerical scores. We then used the apply
method to apply this function to the Score
column and create a new Grade
column.
4.2.1.2 3. Using Conditional Logic with np.where
The np.where
function from the NumPy library allows for vectorized conditional operations, which can be very efficient for large DataFrames.
import numpy as np
# Sample DataFrame
data = {
'Age': [15, 22, 45, 35, 63, 25]
}
df = pd.DataFrame(data)
# Recoding using np.where
df['Age_Group'] = np.where(df['Age'] < 18, 'Child',
np.where(df['Age'] < 65, 'Adult', 'Senior'))
print(df)
Output:
Age Age_Group
0 15 Child
1 22 Adult
2 45 Adult
3 35 Adult
4 63 Adult
5 25 Adult
In this example, we used np.where
to classify individuals into age groups based on their age. The first condition checks if the age is less than 18, classifying the individual as a “Child”. If not, the second condition checks if the age is less than 65, classifying the individual as an “Adult”. Otherwise, the individual is classified as a “Senior”.
4.2.2 Summary
Recoding columns in a Pandas DataFrame is a common task in data preparation and transformation. This section covered various methods for recoding:
- Using the map
method for straightforward value replacements.
- Using the replace
method for flexible value replacements.
- Using the apply
method with custom functions for complex recoding logic.
- Using np.where
for efficient conditional recoding.
By mastering these techniques, you can effectively clean and transform your data to prepare it for analysis and modeling.
Let’s explore what a.empty
, a.bool()
, a.item()
, a.any()
, and a.all()
do in Python, particularly when used with pandas Series or DataFrames.
4.2.3 1. a.empty
- Purpose: Checks if a DataFrame or Series is empty.
- Usage:
- Returns
True
if the DataFrame or Series has no elements (i.e., zero rows or zero columns). - Otherwise, it returns
False
.
- Returns
4.2.3.1 Example:
import pandas as pd
# Empty DataFrame
df_empty = pd.DataFrame()
print(df_empty.empty) # Output: True
# Non-empty DataFrame
df_non_empty = pd.DataFrame({'A': [1, 2]})
print(df_non_empty.empty) # Output: False
# Empty Series
s_empty = pd.Series([])
print(s_empty.empty) # Output: True
# Non-empty Series
s_non_empty = pd.Series([1, 2, 3])
print(s_non_empty.empty) # Output: False
4.2.4 2. a.bool()
- Purpose: Returns the boolean value of a Series or DataFrame.
- Usage:
- Can be used only if the Series or DataFrame contains exactly one element.
- If there is more than one element, it raises a
ValueError
.
4.2.5 3. a.item()
- Purpose: Returns the single item from a Series.
- Usage:
- Can be used only if the Series has exactly one element.
- Raises a
ValueError
if there are multiple elements.
4.2.6 4. a.any()
- Purpose: Checks if any element in a Series or DataFrame is
True
. - Usage:
- Returns
True
if at least one element isTrue
; otherwise, returnsFalse
. - Can be applied to both Series and DataFrames.
- Returns
4.2.6.1 Example:
import pandas as pd
# Example Series
s = pd.Series([False, True, False])
print(s.any()) # Output: True
# Example DataFrame
df = pd.DataFrame({'A': [0, 1, 0], 'B': [False, False, False]})
print(df.any()) # Output:
# A True
# B False
# dtype: bool
# To check if any value is True in the entire DataFrame
print(df.any().any()) # Output: True
4.2.7 5. a.all()
- Purpose: Checks if all elements in a Series or DataFrame are
True
. - Usage:
- Returns
True
if all elements areTrue
; otherwise, returnsFalse
. - Can be applied to both Series and DataFrames.
- Returns
4.2.7.1 Example:
import pandas as pd
# Example Series
s = pd.Series([True, True, True])
print(s.all()) # Output: True
# Example DataFrame
df = pd.DataFrame({'A': [1, 1, 1], 'B': [True, True, True]})
print(df.all()) # Output:
# A True
# B True
# dtype: bool
# Checking if all values are True in the entire DataFrame
print(df.all().all()) # Output: True
4.2.8 Summary:
a.empty
: Checks if the DataFrame or Series is empty.a.bool()
: Converts a Series or DataFrame with one element to a boolean.a.item()
: Retrieves a single element from a Series (if it has only one element).a.any()
: Checks if any element isTrue
.a.all()
: Checks if all elements areTrue
.
These functions help in checking conditions or accessing data in pandas objects and are useful in various data analysis tasks.
4.3 apply method
Using apply
in NumPy and Pandas is a common practice to perform element-wise operations or to apply custom functions across data structures.
Here are some examples:
4.3.1 Using apply
in Pandas:
Pandas’ apply
function is very powerful for applying custom functions along either axis (rows or columns) of a DataFrame or across a Series.
4.3.1.1 Applying a Function to a DataFrame Column
Imagine you have a DataFrame df
with a column salary
, and you want to calculate a 10% increase for all salaries.
4.3.1.2 Applying a Function Across Rows
You have a DataFrame with columns x
and y
, and you want to create a new column z
which is the sum of x
and y
.
4.3.2 Using apply
in NumPy:
NumPy’s apply_along_axis
function is used to apply a function along a particular axis of a NumPy array.
4.3.2.1 Applying a Function to Each Row of a 2D NumPy Array
Suppose you have a 2D array and want to calculate the sum of each row.
4.3.3 Key Points to Mention in an Interview:
Pandas’
apply
function is very versatile and is generally used to apply custom functions row-wise or column-wise in a DataFrame or to a Series.NumPy’s
apply_along_axis
is more specific and used for applying a function along a specific axis (rows or columns) of an array.These functions are particularly useful when built-in functions are not sufficient or when custom operations are needed.
While
apply
can be convenient, it’s worth mentioning that it can be slower than vectorized operations in Pandas and NumPy, so it should be used judiciously.
These examples should provide you with a solid foundation to demonstrate your understanding of apply
in NumPy and Pandas during a data science interview.