Chapter 2 Pandas Library

run pip3 install pandas or run !pip install pandas on rstudio terminal or mac terminal or jupyter notebook

import pandas as pd 
pd.set_option('display.max_columns', None)

2.1 Data Frames

2.1.1 Data frame creation

Sometimes you may need to create an data frame

## empty data frame
df1 = pd.DataFrame()
df1
## Empty DataFrame
## Columns: []
## Index: []
df2 = pd.DataFrame({'A' : []})
df2
## Empty DataFrame
## Columns: [A]
## Index: []

Is df2 empty?

There is column name but it is still empty.

df2.empty
## True

How to create a sample data frame?

From a dictionary

my_dict = {'Col_1': [1,2,3,4], 
           'Col_2': ['a', 'b', 'c', 'd'],
           'Col_3': 1984}
           
my_df = pd.DataFrame(my_dict)

my_df[:5]
##    Col_1 Col_2  Col_3
## 0      1     a   1984
## 1      2     b   1984
## 2      3     c   1984
## 3      4     d   1984
a_list = [1,2,3,4]
b_list = ['a', 'b', 'c', 'd']
##
df3 = pd.DataFrame({'var1': a_list, 
                    'var2': b_list})
df3
##    var1 var2
## 0     1    a
## 1     2    b
## 2     3    c
## 3     4    d

2.1.2 Read csv files

df_csv = pd.read_csv('data/college.csv').iloc[:100, :5]
df_csv[:5]
##        id                             name        city state region
## 0  102669        Alaska Pacific University   Anchorage    AK   West
## 1  101648        Marion Military Institute      Marion    AL  South
## 2  100830  Auburn University at Montgomery  Montgomery    AL  South
## 3  101879      University of North Alabama    Florence    AL  South
## 4  100858                Auburn University      Auburn    AL  South

2.1.3 Data attributes

2.1.3.1 shape of data frame

df_csv.shape
## (100, 5)
print("number of rows:", df_csv.shape[0],
      "\nnumber of columns:", df_csv.shape[1])
## number of rows: 100 
## number of columns: 5

2.1.3.2 Columns

How to get the column names?

–> Column names are stored in columns attribute.

df_csv.columns
## Index(['id', 'name', 'city', 'state', 'region'], dtype='object')
list(df_csv.columns)
## ['id', 'name', 'city', 'state', 'region']
df_csv.columns.to_list()
## ['id', 'name', 'city', 'state', 'region']
for cols in df_csv.columns:
  print(cols)
## id
## name
## city
## state
## region

2.1.3.3 data types in dataframe

df_csv.dtypes
## id         int64
## name      object
## city      object
## state     object
## region    object
## dtype: object
[types for types in df_csv.dtypes]
## [dtype('int64'), dtype('O'), dtype('O'), dtype('O'), dtype('O')]

2.1.3.4 index of the data frame

df_csv[:5]
##        id                             name        city state region
## 0  102669        Alaska Pacific University   Anchorage    AK   West
## 1  101648        Marion Military Institute      Marion    AL  South
## 2  100830  Auburn University at Montgomery  Montgomery    AL  South
## 3  101879      University of North Alabama    Florence    AL  South
## 4  100858                Auburn University      Auburn    AL  South
df_csv.index
## RangeIndex(start=0, stop=100, step=1)
for i in df_csv.index:
  if i < 5:
    print(i)
## 0
## 1
## 2
## 3
## 4

2.1.3.5 assign a new index

df = df_csv.copy()
df.index = df.state

df[:5]
##            id                             name        city state region
## state                                                                  
## AK     102669        Alaska Pacific University   Anchorage    AK   West
## AL     101648        Marion Military Institute      Marion    AL  South
## AL     100830  Auburn University at Montgomery  Montgomery    AL  South
## AL     101879      University of North Alabama    Florence    AL  South
## AL     100858                Auburn University      Auburn    AL  South

2.1.3.6 index to column

df = df_csv.copy()
df['index_column'] = df.index

df[:5]
##        id                             name        city state region  \
## 0  102669        Alaska Pacific University   Anchorage    AK   West   
## 1  101648        Marion Military Institute      Marion    AL  South   
## 2  100830  Auburn University at Montgomery  Montgomery    AL  South   
## 3  101879      University of North Alabama    Florence    AL  South   
## 4  100858                Auburn University      Auburn    AL  South   
## 
##    index_column  
## 0             0  
## 1             1  
## 2             2  
## 3             3  
## 4             4
  • calling keys() function: output is similar to columns attributes
df_csv.keys()
## Index(['id', 'name', 'city', 'state', 'region'], dtype='object')
  • column.values method returns an array of index.
list(df_csv.columns.values)
## ['id', 'name', 'city', 'state', 'region']

Using tolist() method with values with given the list of columns.

list(df_csv.columns.values.tolist())
## ['id', 'name', 'city', 'state', 'region']

Using sorted() method : sorted() method will return the list of columns sorted in alphabetical order.

sorted(df3)
## ['var1', 'var2']

2.1.4 Create or add a new column

You can create a new column in a pandas DataFrame based on the values of existing columns using various methods, such as assignment, the apply() function, or vectorized operations.

Here are some examples:

2.1.5 Explore Data

#recipes.head()

2.2 read data

2.2.1 csv file

IBM sample data: I could not run with “https” because I did not have a certificate installed. So, I go on with “http” and it worked.

data_link = "http://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DS0103EN/labs/data/recipes.csv"

recipes = pd.read_csv(data_link)

2.2.2 xlsx file

pandas.read_excel(io = path, sheet_name = 0, header = 0, names = None, index_col = None, usecols = None)

This returns dataframe object.

df = pd.read_excel(io = "./data/segmentation.xlsx",
                  sheet_name = "sheet")
                  
df.head(n=5)
##    respondent_id  variety_of_choice  electronics  furniture  \
## 0              1                  8            6          6   
## 1              2                  6            3          1   
## 2              3                  6            1          2   
## 3              4                  8            3          3   
## 4              5                  4            6          3   
## 
##    quality_of_service  low_prices  return_policy  professional  income  age  
## 0                   3           2              2             1      40   45  
## 1                   4           7              8             0      20   41  
## 2                   4           9              6             0      20   31  
## 3                   4           8              7             1      30   37  
## 4                   9           2              5             1      45   56

2.2.3 rename/assign new column names

Renaming or assigning new column names in pandas DataFrame is a common operation, and there are several ways to accomplish it. Here are some examples with explanations:

2.2.3.1 Using the rename() method:

The rename() method allows you to rename columns by specifying a dictionary where keys are the current column names and values are the new column names.

df = df_csv.copy()
df.rename(columns={'id': 'NEW_ID'}, inplace=True)
df.columns.to_list()
## ['NEW_ID', 'name', 'city', 'state', 'region']

Setting inplace=True will modify the DataFrame in place, otherwise a new DataFrame with the renamed columns will be returned.

2.2.3.2 Direct assignment with dictionary:

You can directly assign new column names to the columns attribute of the DataFrame using a dictionary.

Note: you need to specify all the column names in the dictionary.

df = df_csv[['name', 'state', 'id']].copy()
df.columns = {'MY_NAME': 'name', 'MY_STATE': 'state', 'id':'id'}
df.columns.to_list()
## ['MY_NAME', 'MY_STATE', 'id']

2.2.3.3 Using list assignment:

You can directly assign a list of column names to the columns attribute of the DataFrame.

Note: Make sure the length of the list matches the number of columns in the DataFrame.

df = df_csv[['name', 'state']].copy()
df.columns = ['MY_NAME', 'MY_STATE']
df.columns.to_list()
## ['MY_NAME', 'MY_STATE']

2.2.3.4 Assigning new column names during DataFrame creation:

You can provide the columns parameter when creating a DataFrame to specify the column names.

df = pd.DataFrame(data, columns=['column1', 'column2', 'column3'])

2.2.3.5 to uppercase all column names

df = df_csv.copy()
df.columns = [x.upper() for x in df.columns]
df.columns.to_list()
## ['ID', 'NAME', 'CITY', 'STATE', 'REGION']

2.2.3.6 rename

Renaming column names is a standard procedure most of the time. We may need to standardize column names, mostly make them clean.

There is a nice link here.

Lets create a sample data frame

df = pd.DataFrame({'A?la': [1,2,3,4],
                    'PYTHON': ['a', 'a', 'a', 'a'],
                    'col 3': ['1','1','b','b']})
df
##    A?la PYTHON col 3
## 0     1      a     1
## 1     2      a     1
## 2     3      a     b
## 3     4      a     b

Method 1: Rename Specific Columns

df.rename(columns = {'PYTHON':'Python'}, inplace = False)
##    A?la Python col 3
## 0     1      a     1
## 1     2      a     1
## 2     3      a     b
## 3     4      a     b
### list column names
list(df)
## ['A?la', 'PYTHON', 'col 3']

Method 2: Rename All Columns

new_names = ['new_name1', 'new_name2','new_name3']
df.columns = new_names
list(df)
## ['new_name1', 'new_name2', 'new_name3']

Method 3: Replace Specific Characters in Columns

### set problematic names
problem_names = ['VAR1$$$', 'var2','Var3###']
df.columns = problem_names

### removing bad characters
df.columns = df.columns.str.replace('$', '', regex = True)
df.columns = df.columns.str.replace('#', '', regex = True)
list(df)
## ['VAR1', 'var2', 'Var3']

clean column names

Generally we expect clean columns to be: 1. short 2. meaningful 3. no space 4. no special character 5. probably lowercase

### set problematic names
problem_names = ['VAR  $1', 'var #2','Var ? 3 ']
df.columns = problem_names

## Column names: remove white spaces and convert to lower case
df.columns= df.columns.str.strip()

## Column names: convert to lower case
df.columns = df.columns.str.lower()

## Column names: convert to lower case
df = df.rename(columns = str.lower)

## removing bad characters
df.columns = df.columns.str.replace('$', '', regex = True)
df.columns = df.columns.str.replace('#', '', regex = True)
df.columns = df.columns.str.replace('?', '', regex = True)
df.columns = df.columns.str.replace(' ', '', regex = True)

list(df)
## ['var1', 'var2', 'var3']
  • sometimes we need them to be uppercase
## Column names: convert to upper case
df.columns = df.columns.str.upper()

list(df)
## ['VAR1', 'VAR2', 'VAR3']

2.2.4 Subsetting

2.2.4.1 by columns

selected_cols = ['id', 'name', 'city']

df_csv[selected_cols]
##         id                                        name           city
## 0   102669                   Alaska Pacific University      Anchorage
## 1   101648                   Marion Military Institute         Marion
## 2   100830             Auburn University at Montgomery     Montgomery
## 3   101879                 University of North Alabama       Florence
## 4   100858                           Auburn University         Auburn
## ..     ...                                         ...            ...
## 95  118888                               Mills College        Oakland
## 96  110097                            Biola University      La Mirada
## 97  117140                      University of La Verne       La Verne
## 98  125727                            Westmont College  Santa Barbara
## 99  123651  Vanguard University of Southern California     Costa Mesa
## 
## [100 rows x 3 columns]

2.2.4.2 Label based: df.loc[row_label, column_label]

Selecting Rows and Columns by Labels:

You can use df.loc[] to select rows and columns of a DataFrame using labels. The syntax is df.loc[row_label, column_label].

You can specify a single label or a list of labels for both rows and columns.

row label is the index.

# Select a single row and column
df_csv.loc[1, 'name']
## 'Marion Military Institute'
# Select multiple rows and columns
df_csv.loc[[0, 2, 4], ['name', 'state']]
##                               name state
## 0        Alaska Pacific University    AK
## 2  Auburn University at Montgomery    AL
## 4                Auburn University    AL
# Select rows from 'row_label_start' to 'row_label_end'
df_csv.loc[0:4]
##        id                             name        city state region
## 0  102669        Alaska Pacific University   Anchorage    AK   West
## 1  101648        Marion Military Institute      Marion    AL  South
## 2  100830  Auburn University at Montgomery  Montgomery    AL  South
## 3  101879      University of North Alabama    Florence    AL  South
## 4  100858                Auburn University      Auburn    AL  South

## all same
# df_csv.loc[0:4, ]
# df_csv.loc[0:4, :]
# Select rows where a condition is True
df_csv.loc[df_csv['id'] < 101000]
##         id                                 name        city state region
## 2   100830      Auburn University at Montgomery  Montgomery    AL  South
## 4   100858                    Auburn University      Auburn    AL  South
## 5   100663  University of Alabama at Birmingham  Birmingham    AL  South
## 9   100751            The University of Alabama  Tuscaloosa    AL  South
## 11  100706  University of Alabama in Huntsville  Huntsville    AL  South
## 18  100937          Birmingham Southern College  Birmingham    AL  South
## 21  100724             Alabama State University  Montgomery    AL  South
## 23  100654             Alabama A & M University      Normal    AL  South
df = df_csv.copy()
# Set a value for a specific row and column
df.loc[10, 'name'] = 'My University'

2.2.4.3 Filtering Rows Based on a Single Condition:

df = df_csv.copy()
# Select rows where the 'column_name' equals a specific value
df[df['name'] == 'Auburn University']
##        id               name    city state region
## 4  100858  Auburn University  Auburn    AL  South
df[df.name == 'Auburn University']
##        id               name    city state region
## 4  100858  Auburn University  Auburn    AL  South

2.2.4.4 Filtering Rows Based on Multiple Conditions (AND):

# Select rows where 'column1' equals 'value1' and 'column2' equals 'value2'
df[(df.city == 'Birmingham') & (df.state == 'AL')]
##         id                                 name        city state region
## 5   100663  University of Alabama at Birmingham  Birmingham    AL  South
## 7   102049                   Samford University  Birmingham    AL  South
## 10  102261           Southeastern Bible College  Birmingham    AL  South
## 18  100937          Birmingham Southern College  Birmingham    AL  South

2.2.5 copy

When you use loc/iloc to create a subset of the DataFrame, you are also creating a view into the original DataFrame. Modifications to the subset will reflect in the original DataFrame, and vice versa.

subset = df.loc['row_label1':'row_label5', 'column_labelA':'column_labelB']

subset = df.iloc[0:5, 1:3]  # Select rows 0 to 4 and columns 1 to 2

To create an independent copy of the DataFrame, you can use the copy() method. This ensures that modifications to the copied DataFrame do not affect the original DataFrame.

copy_df = df.copy()