Chapter 4 Pandas Library

run pip3 install pandas or run !pip install pandas on rstudio terminal or mac terminal or jupyter notebook

import pandas as pd 

pd.set_option('display.max_columns', None)

4.1 Data Frames

4.1.1 Data frame creation

Sometimes you may need to create a data frame

## empty data frame
df1 = pd.DataFrame(dtype = 'float64')
df1
## Empty DataFrame
## Columns: []
## Index: []
df2 = pd.DataFrame({'A' : []})
df2
## Empty DataFrame
## Columns: [A]
## Index: []

Is df2 empty?

There is column name but it is still empty.

df2.empty
## True

How to create a sample data frame?

From a dictionary

my_dict = {'Col_1': [1,2,3,4], 
           'Col_2': ['a', 'b', 'c', 'd'],
           'Col_3': 1984}
           
my_df = pd.DataFrame(my_dict)

my_df[:5]
##    Col_1 Col_2  Col_3
## 0      1     a   1984
## 1      2     b   1984
## 2      3     c   1984
## 3      4     d   1984
a_list = [1,2,3,4]
b_list = ['a', 'b', 'c', 'd']
##
df3 = pd.DataFrame({'var1': a_list, 
                    'var2': b_list})
df3
##    var1 var2
## 0     1    a
## 1     2    b
## 2     3    c
## 3     4    d

4.1.2 Read csv files

df_csv = pd.read_csv('data/college.csv').iloc[:100, :5]
## [Errno 2] No such file or directory: 'data/college.csv'
df_csv[:5]
## name 'df_csv' is not defined

4.1.3 Data attributes

4.1.3.1 shape of data frame

df_csv.shape
## name 'df_csv' is not defined
print("number of rows:", df_csv.shape[0],
      "\nnumber of columns:", df_csv.shape[1])
## name 'df_csv' is not defined

4.1.3.2 Columns

How to get the column names?

–> Column names are stored in columns attribute.

df_csv.columns
## name 'df_csv' is not defined
list(df_csv.columns)
## name 'df_csv' is not defined
df_csv.columns.to_list()
## name 'df_csv' is not defined
for cols in df_csv.columns:
  print(cols)
## name 'df_csv' is not defined

4.1.3.3 data types in dataframe

df_csv.dtypes
## name 'df_csv' is not defined
[types for types in df_csv.dtypes]
## name 'df_csv' is not defined

4.1.3.4 index of the data frame

df_csv[:5]
## name 'df_csv' is not defined
df_csv.index
## name 'df_csv' is not defined
for i in df_csv.index:
  if i < 5:
    print(i)
## name 'df_csv' is not defined

4.1.3.5 assign a new index

df = df_csv.copy()
## name 'df_csv' is not defined
df.index = df.state
## name 'df' is not defined
df[:5]
## name 'df' is not defined

4.1.3.6 index to column

df = df_csv.copy()
## name 'df_csv' is not defined
df['index_column'] = df.index
## name 'df' is not defined
df[:5]
## name 'df' is not defined
  • calling keys() function: output is similar to columns attributes
df_csv.keys()
## name 'df_csv' is not defined
  • column.values method returns an array of index.
list(df_csv.columns.values)
## name 'df_csv' is not defined

Using tolist() method with values with given the list of columns.

list(df_csv.columns.values.tolist())
## name 'df_csv' is not defined

Using sorted() method : sorted() method will return the list of columns sorted in alphabetical order.

sorted(df3)
## ['var1', 'var2']

4.1.4 Create or add a new column

You can create a new column in a pandas DataFrame based on the values of existing columns using various methods, such as assignment, the apply() function, or vectorized operations.

Here are some examples:

4.1.5 Explore Data

#recipes.head()

4.1.6 rename/assign new column names

Renaming or assigning new column names in pandas DataFrame is a common operation, and there are several ways to accomplish it. Here are some examples with explanations:

4.1.6.1 Using the rename() method:

The rename() method allows you to rename columns by specifying a dictionary where keys are the current column names and values are the new column names.

df = df_csv.copy()
## name 'df_csv' is not defined
df.rename(columns={'id': 'NEW_ID'}, inplace=True)
## name 'df' is not defined
df.columns.to_list()
## name 'df' is not defined

Setting inplace=True will modify the DataFrame in place, otherwise a new DataFrame with the renamed columns will be returned.

4.1.6.2 Direct assignment with dictionary:

You can directly assign new column names to the columns attribute of the DataFrame using a dictionary.

Note: you need to specify all the column names in the dictionary.

df = df_csv[['name', 'state', 'id']].copy()
## name 'df_csv' is not defined
df.columns = {'MY_NAME': 'name', 'MY_STATE': 'state', 'id':'id'}
## name 'df' is not defined
df.columns.to_list()
## name 'df' is not defined

4.1.6.3 Using list assignment:

You can directly assign a list of column names to the columns attribute of the DataFrame.

Note: Make sure the length of the list matches the number of columns in the DataFrame.

df = df_csv[['name', 'state']].copy()
## name 'df_csv' is not defined
df.columns = ['MY_NAME', 'MY_STATE']
## name 'df' is not defined
df.columns.to_list()
## name 'df' is not defined

4.1.6.4 Assigning new column names during DataFrame creation:

You can provide the columns parameter when creating a DataFrame to specify the column names.

df = pd.DataFrame(data, columns=['column1', 'column2', 'column3'])

4.1.6.5 to uppercase all column names

df = df_csv.copy()
## name 'df_csv' is not defined
df.columns = [x.upper() for x in df.columns]
## name 'df' is not defined
df.columns.to_list()
## name 'df' is not defined

4.1.6.6 rename

Renaming column names is a standard procedure most of the time. We may need to standardize column names, mostly make them clean.

There is a nice link here.

Lets create a sample data frame

df = pd.DataFrame({'A?la': [1,2,3,4],
                    'PYTHON': ['a', 'a', 'a', 'a'],
                    'col 3': ['1','1','b','b']})
df
##    A?la PYTHON col 3
## 0     1      a     1
## 1     2      a     1
## 2     3      a     b
## 3     4      a     b

Method 1: Rename Specific Columns

df.rename(columns = {'PYTHON':'Python'}, inplace = False)
##    A?la Python col 3
## 0     1      a     1
## 1     2      a     1
## 2     3      a     b
## 3     4      a     b
### list column names
list(df)
## ['A?la', 'PYTHON', 'col 3']

Method 2: Rename All Columns

new_names = ['new_name1', 'new_name2','new_name3']
df.columns = new_names
list(df)
## ['new_name1', 'new_name2', 'new_name3']

Method 3: Replace Specific Characters in Columns

### set problematic names
problem_names = ['VAR1$$$', 'var2','Var3###']
df.columns = problem_names

### removing bad characters
df.columns = df.columns.str.replace('$', '', regex = True)
df.columns = df.columns.str.replace('#', '', regex = True)
list(df)
## ['VAR1', 'var2', 'Var3']

clean column names

Generally we expect clean columns to be: 1. short 2. meaningful 3. no space 4. no special character 5. probably lowercase

### set problematic names
problem_names = ['VAR  $1', 'var #2','Var ? 3 ']
df.columns = problem_names

## Column names: remove white spaces and convert to lower case
df.columns= df.columns.str.strip()

## Column names: convert to lower case
df.columns = df.columns.str.lower()

## Column names: convert to lower case
df = df.rename(columns = str.lower)

## removing bad characters
df.columns = df.columns.str.replace('$', '', regex = True)
df.columns = df.columns.str.replace('#', '', regex = True)
df.columns = df.columns.str.replace('?', '', regex = True)
df.columns = df.columns.str.replace(' ', '', regex = True)

list(df)
## ['var1', 'var2', 'var3']
  • sometimes we need them to be uppercase
## Column names: convert to upper case
df.columns = df.columns.str.upper()

list(df)
## ['VAR1', 'VAR2', 'VAR3']

4.1.7 Subsetting

4.1.7.1 by columns

selected_cols = ['id', 'name', 'city']

df_csv[selected_cols]
## name 'df_csv' is not defined

4.1.7.2 Label based: df.loc[row_label, column_label]

Selecting Rows and Columns by Labels:

You can use df.loc[] to select rows and columns of a DataFrame using labels. The syntax is df.loc[row_label, column_label].

You can specify a single label or a list of labels for both rows and columns.

row label is the index.

# Select a single row and column
df_csv.loc[1, 'name']
## name 'df_csv' is not defined
# Select multiple rows and columns
df_csv.loc[[0, 2, 4], ['name', 'state']]
## name 'df_csv' is not defined
# Select rows from 'row_label_start' to 'row_label_end'
df_csv.loc[0:4]
## name 'df_csv' is not defined

## all same
# df_csv.loc[0:4, ]
# df_csv.loc[0:4, :]
# Select rows where a condition is True
df_csv.loc[df_csv['id'] < 101000]
## name 'df_csv' is not defined
df = df_csv.copy()
## name 'df_csv' is not defined
# Set a value for a specific row and column
df.loc[10, 'name'] = 'My University'

4.1.7.3 Filtering Rows Based on a Single Condition:

df = df_csv.copy()
## name 'df_csv' is not defined
# Select rows where the 'column_name' equals a specific value
df[df['name'] == 'Auburn University']
## Empty DataFrame
## Columns: [VAR1, VAR2, VAR3, name]
## Index: []
df[df.name == 'Auburn University']
## Empty DataFrame
## Columns: [VAR1, VAR2, VAR3, name]
## Index: []

4.1.7.4 Filtering Rows Based on Multiple Conditions (AND):

# Select rows where 'column1' equals 'value1' and 'column2' equals 'value2'
df[(df.city == 'Birmingham') & (df.state == 'AL')]
## 'DataFrame' object has no attribute 'city'

4.1.8 copy

When you use loc/iloc to create a subset of the DataFrame, you are also creating a view into the original DataFrame. Modifications to the subset will reflect in the original DataFrame, and vice versa.

subset = df.loc['row_label1':'row_label5', 'column_labelA':'column_labelB']

subset = df.iloc[0:5, 1:3]  # Select rows 0 to 4 and columns 1 to 2

To create an independent copy of the DataFrame, you can use the copy() method. This ensures that modifications to the copied DataFrame do not affect the original DataFrame.

copy_df = df.copy()

4.2 Recoding Columns in a Pandas DataFrame

Data manipulation is a crucial part of any data science project. One common task is recoding columns in a DataFrame, which involves transforming the values in a column based on certain rules or mapping schemes. This can be necessary for various reasons, such as preparing data for analysis, converting categorical variables, or cleaning data. In this section, we will explore different methods to recode columns in a Pandas DataFrame.

4.2.1 Using map and replace

The map and replace methods are convenient for simple value transformations and mappings.

Using map for Recoding:

The map method is useful for replacing values in a Series (a single column in a DataFrame) according to a mapping dictionary.

import pandas as pd

# Sample DataFrame
data = {
    'Category': ['A', 'B', 'C', 'A', 'B', 'C']
}
df = pd.DataFrame(data)

# Mapping dictionary
category_map = {
    'A': 'Alpha',
    'B': 'Beta',
    'C': 'Gamma'
}

# Recoding using map
df['Category_Recoded'] = df['Category'].map(category_map)
print(df)

Output:

  Category Category_Recoded
0        A            Alpha
1        B             Beta
2        C            Gamma
3        A            Alpha
4        B             Beta
5        C            Gamma

In this example, we created a mapping dictionary category_map and used the map method to create a new column Category_Recoded with the recoded values.

Using replace for Recoding:

The replace method can be used for more flexible replacements, including partial string replacements and handling missing values.

# Sample DataFrame with NaN values
data = {
    'Category': ['A', 'B', 'C', 'A', 'B', 'C', None]
}
df = pd.DataFrame(data)

# Recoding using replace
df['Category_Recoded'] = df['Category'].replace({
    'A': 'Alpha',
    'B': 'Beta',
    'C': 'Gamma',
    None: 'Unknown'
})
print(df)

Output:

  Category Category_Recoded
0        A            Alpha
1        B             Beta
2        C            Gamma
3        A            Alpha
4        B             Beta
5        C            Gamma
6     None          Unknown

In this example, we used the replace method to handle missing values (None) and recode the Category column.

4.2.1.1 2. Using apply for Custom Functions

For more complex recoding logic, you can use the apply method with a custom function.

# Sample DataFrame
data = {
    'Score': [85, 92, 78, 88, 95, 70]
}
df = pd.DataFrame(data)

# Custom function to recode scores
def recode_score(score):
    if score >= 90:
        return 'A'
    elif score >= 80:
        return 'B'
    elif score >= 70:
        return 'C'
    else:
        return 'D'

# Recoding using apply
df['Grade'] = df['Score'].apply(recode_score)
print(df)

Output:

   Score Grade
0     85     B
1     92     A
2     78     C
3     88     B
4     95     A
5     70     C

In this example, we defined a custom function recode_score that assigns letter grades based on numerical scores. We then used the apply method to apply this function to the Score column and create a new Grade column.

4.2.1.2 3. Using Conditional Logic with np.where

The np.where function from the NumPy library allows for vectorized conditional operations, which can be very efficient for large DataFrames.

import numpy as np

# Sample DataFrame
data = {
    'Age': [15, 22, 45, 35, 63, 25]
}
df = pd.DataFrame(data)

# Recoding using np.where
df['Age_Group'] = np.where(df['Age'] < 18, 'Child',
                           np.where(df['Age'] < 65, 'Adult', 'Senior'))
print(df)

Output:

   Age Age_Group
0   15     Child
1   22     Adult
2   45     Adult
3   35     Adult
4   63     Adult
5   25     Adult

In this example, we used np.where to classify individuals into age groups based on their age. The first condition checks if the age is less than 18, classifying the individual as a “Child”. If not, the second condition checks if the age is less than 65, classifying the individual as an “Adult”. Otherwise, the individual is classified as a “Senior”.

4.2.2 Summary

Recoding columns in a Pandas DataFrame is a common task in data preparation and transformation. This section covered various methods for recoding: - Using the map method for straightforward value replacements. - Using the replace method for flexible value replacements. - Using the apply method with custom functions for complex recoding logic. - Using np.where for efficient conditional recoding.

By mastering these techniques, you can effectively clean and transform your data to prepare it for analysis and modeling.

Let’s explore what a.empty, a.bool(), a.item(), a.any(), and a.all() do in Python, particularly when used with pandas Series or DataFrames.

4.2.3 1. a.empty

  • Purpose: Checks if a DataFrame or Series is empty.
  • Usage:
    • Returns True if the DataFrame or Series has no elements (i.e., zero rows or zero columns).
    • Otherwise, it returns False.

4.2.3.1 Example:

import pandas as pd

# Empty DataFrame
df_empty = pd.DataFrame()
print(df_empty.empty)  # Output: True

# Non-empty DataFrame
df_non_empty = pd.DataFrame({'A': [1, 2]})
print(df_non_empty.empty)  # Output: False

# Empty Series
s_empty = pd.Series([])
print(s_empty.empty)  # Output: True

# Non-empty Series
s_non_empty = pd.Series([1, 2, 3])
print(s_non_empty.empty)  # Output: False

4.2.4 2. a.bool()

  • Purpose: Returns the boolean value of a Series or DataFrame.
  • Usage:
    • Can be used only if the Series or DataFrame contains exactly one element.
    • If there is more than one element, it raises a ValueError.

4.2.4.1 Example:

import pandas as pd

# Single-element Series
s = pd.Series([True])
print(s.bool())  # Output: True

# Multi-element Series
s_multi = pd.Series([True, False])
# print(s_multi.bool())  # Raises ValueError: The truth value of a Series is ambiguous. 

4.2.5 3. a.item()

  • Purpose: Returns the single item from a Series.
  • Usage:
    • Can be used only if the Series has exactly one element.
    • Raises a ValueError if there are multiple elements.

4.2.5.1 Example:

import pandas as pd

# Single-element Series
s_single = pd.Series([42])
print(s_single.item())  # Output: 42

# Multi-element Series
s_multi = pd.Series([42, 43])
# print(s_multi.item())  # Raises ValueError: can only convert an array of size 1 to a Python scalar

4.2.6 4. a.any()

  • Purpose: Checks if any element in a Series or DataFrame is True.
  • Usage:
    • Returns True if at least one element is True; otherwise, returns False.
    • Can be applied to both Series and DataFrames.

4.2.6.1 Example:

import pandas as pd

# Example Series
s = pd.Series([False, True, False])
print(s.any())  # Output: True

# Example DataFrame
df = pd.DataFrame({'A': [0, 1, 0], 'B': [False, False, False]})
print(df.any())  # Output: 
# A    True
# B    False
# dtype: bool

# To check if any value is True in the entire DataFrame
print(df.any().any())  # Output: True

4.2.7 5. a.all()

  • Purpose: Checks if all elements in a Series or DataFrame are True.
  • Usage:
    • Returns True if all elements are True; otherwise, returns False.
    • Can be applied to both Series and DataFrames.

4.2.7.1 Example:

import pandas as pd

# Example Series
s = pd.Series([True, True, True])
print(s.all())  # Output: True

# Example DataFrame
df = pd.DataFrame({'A': [1, 1, 1], 'B': [True, True, True]})
print(df.all())  # Output:
# A    True
# B    True
# dtype: bool

# Checking if all values are True in the entire DataFrame
print(df.all().all())  # Output: True

4.2.8 Summary:

  • a.empty: Checks if the DataFrame or Series is empty.
  • a.bool(): Converts a Series or DataFrame with one element to a boolean.
  • a.item(): Retrieves a single element from a Series (if it has only one element).
  • a.any(): Checks if any element is True.
  • a.all(): Checks if all elements are True.

These functions help in checking conditions or accessing data in pandas objects and are useful in various data analysis tasks.

4.3 apply method

Using apply in NumPy and Pandas is a common practice to perform element-wise operations or to apply custom functions across data structures.

Here are some examples:

4.3.1 Using apply in Pandas:

Pandas’ apply function is very powerful for applying custom functions along either axis (rows or columns) of a DataFrame or across a Series.

4.3.1.1 Applying a Function to a DataFrame Column

Imagine you have a DataFrame df with a column salary, and you want to calculate a 10% increase for all salaries.

import pandas as pd

# Sample DataFrame
data = {'employee': ['Alice', 'Bob', 'Charlie'],
        'salary': [50000, 60000, 70000]}
df = pd.DataFrame(data)

# Apply a function to increase salary by 10%
df['new_salary'] = df['salary'].apply(lambda x: x * 1.10)
print(df)

4.3.1.2 Applying a Function Across Rows

You have a DataFrame with columns x and y, and you want to create a new column z which is the sum of x and y.

# Sample DataFrame
df = pd.DataFrame({'x': [1, 2, 3], 'y': [4, 5, 6]})

# Apply a function row-wise
df['z'] = df.apply(lambda row: row['x'] + row['y'], axis=1)
print(df)

4.3.1.3 Using apply with a Custom Function

You can also use apply to apply a custom function defined outside the apply call.

# Custom function to categorize salary
def categorize_salary(salary):
    if salary > 60000:
        return 'High'
    else:
        return 'Low'

# Apply the custom function to the 'salary' column
df['salary_category'] = df['salary'].apply(categorize_salary)
print(df)

4.3.2 Using apply in NumPy:

NumPy’s apply_along_axis function is used to apply a function along a particular axis of a NumPy array.

4.3.2.1 Applying a Function to Each Row of a 2D NumPy Array

Suppose you have a 2D array and want to calculate the sum of each row.

import numpy as np

# Sample 2D array
arr = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

# Function to calculate sum of an array
def row_sum(row):
    return np.sum(row)

# Apply the function along the axis 1 (rows)
result = np.apply_along_axis(row_sum, axis=1, arr=arr)
print(result)  # Output: [ 6 15 24]

4.3.2.2 Applying a Function to Each Column of a 2D NumPy Array

Similar to the previous example, but applied along the columns.

# Function to calculate the mean of an array
def column_mean(column):
    return np.mean(column)

# Apply the function along the axis 0 (columns)
result = np.apply_along_axis(column_mean, axis=0, arr=arr)
print(result)  # Output: [4. 5. 6.]

4.3.3 Key Points to Mention in an Interview:

  • Pandas’ apply function is very versatile and is generally used to apply custom functions row-wise or column-wise in a DataFrame or to a Series.

  • NumPy’s apply_along_axis is more specific and used for applying a function along a specific axis (rows or columns) of an array.

  • These functions are particularly useful when built-in functions are not sufficient or when custom operations are needed.

  • While apply can be convenient, it’s worth mentioning that it can be slower than vectorized operations in Pandas and NumPy, so it should be used judiciously.

These examples should provide you with a solid foundation to demonstrate your understanding of apply in NumPy and Pandas during a data science interview.