Chapter 2 Pandas Library
run pip3 install pandas
or
run !pip install pandas
on rstudio terminal or mac terminal or jupyter notebook
2.1 Data Frames
2.1.1 Data frame creation
Sometimes you may need to create an data frame
## Empty DataFrame
## Columns: []
## Index: []
## Empty DataFrame
## Columns: [A]
## Index: []
Is df2 empty?
There is column name but it is still empty.
## True
How to create a sample data frame?
From a dictionary
my_dict = {'Col_1': [1,2,3,4],
'Col_2': ['a', 'b', 'c', 'd'],
'Col_3': 1984}
my_df = pd.DataFrame(my_dict)
my_df[:5]
## Col_1 Col_2 Col_3
## 0 1 a 1984
## 1 2 b 1984
## 2 3 c 1984
## 3 4 d 1984
a_list = [1,2,3,4]
b_list = ['a', 'b', 'c', 'd']
##
df3 = pd.DataFrame({'var1': a_list,
'var2': b_list})
df3
## var1 var2
## 0 1 a
## 1 2 b
## 2 3 c
## 3 4 d
2.1.2 Read csv files
## id name city state region
## 0 102669 Alaska Pacific University Anchorage AK West
## 1 101648 Marion Military Institute Marion AL South
## 2 100830 Auburn University at Montgomery Montgomery AL South
## 3 101879 University of North Alabama Florence AL South
## 4 100858 Auburn University Auburn AL South
2.1.3 Data attributes
2.1.3.2 Columns
How to get the column names?
–> Column names are stored in columns
attribute.
## Index(['id', 'name', 'city', 'state', 'region'], dtype='object')
## ['id', 'name', 'city', 'state', 'region']
## ['id', 'name', 'city', 'state', 'region']
## id
## name
## city
## state
## region
2.1.3.3 data types in dataframe
## id int64
## name object
## city object
## state object
## region object
## dtype: object
## [dtype('int64'), dtype('O'), dtype('O'), dtype('O'), dtype('O')]
2.1.3.4 index of the data frame
## id name city state region
## 0 102669 Alaska Pacific University Anchorage AK West
## 1 101648 Marion Military Institute Marion AL South
## 2 100830 Auburn University at Montgomery Montgomery AL South
## 3 101879 University of North Alabama Florence AL South
## 4 100858 Auburn University Auburn AL South
## RangeIndex(start=0, stop=100, step=1)
## 0
## 1
## 2
## 3
## 4
2.1.3.5 assign a new index
## id name city state region
## state
## AK 102669 Alaska Pacific University Anchorage AK West
## AL 101648 Marion Military Institute Marion AL South
## AL 100830 Auburn University at Montgomery Montgomery AL South
## AL 101879 University of North Alabama Florence AL South
## AL 100858 Auburn University Auburn AL South
2.1.3.6 index to column
## id name city state region \
## 0 102669 Alaska Pacific University Anchorage AK West
## 1 101648 Marion Military Institute Marion AL South
## 2 100830 Auburn University at Montgomery Montgomery AL South
## 3 101879 University of North Alabama Florence AL South
## 4 100858 Auburn University Auburn AL South
##
## index_column
## 0 0
## 1 1
## 2 2
## 3 3
## 4 4
- calling
keys()
function: output is similar to columns attributes
## Index(['id', 'name', 'city', 'state', 'region'], dtype='object')
column.values
method returns an array of index.
## ['id', 'name', 'city', 'state', 'region']
Using tolist()
method with values with given the list of columns.
## ['id', 'name', 'city', 'state', 'region']
Using sorted()
method : sorted()
method will return the list of columns sorted in alphabetical order.
## ['var1', 'var2']
2.2 read data
2.2.1 csv file
IBM sample data: I could not run with “https” because I did not have a certificate installed. So, I go on with “http” and it worked.
2.2.2 xlsx file
pandas.read_excel(io = path, sheet_name = 0, header = 0, names = None, index_col = None, usecols = None)
This returns dataframe object.
## respondent_id variety_of_choice electronics furniture \
## 0 1 8 6 6
## 1 2 6 3 1
## 2 3 6 1 2
## 3 4 8 3 3
## 4 5 4 6 3
##
## quality_of_service low_prices return_policy professional income age
## 0 3 2 2 1 40 45
## 1 4 7 8 0 20 41
## 2 4 9 6 0 20 31
## 3 4 8 7 1 30 37
## 4 9 2 5 1 45 56
2.2.3 rename/assign new column names
Renaming or assigning new column names in pandas DataFrame is a common operation, and there are several ways to accomplish it. Here are some examples with explanations:
2.2.3.1 Using the rename() method:
The rename()
method allows you to rename columns by specifying a dictionary where keys are the current column names and values are the new column names.
## ['NEW_ID', 'name', 'city', 'state', 'region']
Setting inplace=True will modify the DataFrame in place, otherwise a new DataFrame with the renamed columns will be returned.
2.2.3.2 Direct assignment with dictionary:
You can directly assign new column names to the columns attribute of the DataFrame using a dictionary.
Note: you need to specify all the column names in the dictionary.
df = df_csv[['name', 'state', 'id']].copy()
df.columns = {'MY_NAME': 'name', 'MY_STATE': 'state', 'id':'id'}
df.columns.to_list()
## ['MY_NAME', 'MY_STATE', 'id']
2.2.3.3 Using list assignment:
You can directly assign a list of column names to the columns attribute of the DataFrame.
Note: Make sure the length of the list matches the number of columns in the DataFrame.
## ['MY_NAME', 'MY_STATE']
2.2.3.4 Assigning new column names during DataFrame creation:
You can provide the columns parameter when creating a DataFrame to specify the column names.
df = pd.DataFrame(data, columns=['column1', 'column2', 'column3'])
2.2.3.6 rename
Renaming column names is a standard procedure most of the time. We may need to standardize column names, mostly make them clean.
There is a nice link here.
Lets create a sample data frame
df = pd.DataFrame({'A?la': [1,2,3,4],
'PYTHON': ['a', 'a', 'a', 'a'],
'col 3': ['1','1','b','b']})
df
## A?la PYTHON col 3
## 0 1 a 1
## 1 2 a 1
## 2 3 a b
## 3 4 a b
Method 1: Rename Specific Columns
## A?la Python col 3
## 0 1 a 1
## 1 2 a 1
## 2 3 a b
## 3 4 a b
## ['A?la', 'PYTHON', 'col 3']
Method 2: Rename All Columns
## ['new_name1', 'new_name2', 'new_name3']
Method 3: Replace Specific Characters in Columns
### set problematic names
problem_names = ['VAR1$$$', 'var2','Var3###']
df.columns = problem_names
### removing bad characters
df.columns = df.columns.str.replace('$', '', regex = True)
df.columns = df.columns.str.replace('#', '', regex = True)
list(df)
## ['VAR1', 'var2', 'Var3']
clean column names
Generally we expect clean columns to be: 1. short 2. meaningful 3. no space 4. no special character 5. probably lowercase
### set problematic names
problem_names = ['VAR $1', 'var #2','Var ? 3 ']
df.columns = problem_names
## Column names: remove white spaces and convert to lower case
df.columns= df.columns.str.strip()
## Column names: convert to lower case
df.columns = df.columns.str.lower()
## Column names: convert to lower case
df = df.rename(columns = str.lower)
## removing bad characters
df.columns = df.columns.str.replace('$', '', regex = True)
df.columns = df.columns.str.replace('#', '', regex = True)
df.columns = df.columns.str.replace('?', '', regex = True)
df.columns = df.columns.str.replace(' ', '', regex = True)
list(df)
## ['var1', 'var2', 'var3']
- sometimes we need them to be uppercase
## ['VAR1', 'VAR2', 'VAR3']
2.2.4 Subsetting
2.2.4.1 by columns
## id name city
## 0 102669 Alaska Pacific University Anchorage
## 1 101648 Marion Military Institute Marion
## 2 100830 Auburn University at Montgomery Montgomery
## 3 101879 University of North Alabama Florence
## 4 100858 Auburn University Auburn
## .. ... ... ...
## 95 118888 Mills College Oakland
## 96 110097 Biola University La Mirada
## 97 117140 University of La Verne La Verne
## 98 125727 Westmont College Santa Barbara
## 99 123651 Vanguard University of Southern California Costa Mesa
##
## [100 rows x 3 columns]
2.2.4.2 Label based: df.loc[row_label, column_label]
Selecting Rows and Columns by Labels:
You can use df.loc[] to select rows and columns of a DataFrame using labels. The syntax is df.loc[row_label, column_label].
You can specify a single label or a list of labels for both rows and columns.
row label is the index.
## 'Marion Military Institute'
## name state
## 0 Alaska Pacific University AK
## 2 Auburn University at Montgomery AL
## 4 Auburn University AL
## id name city state region
## 0 102669 Alaska Pacific University Anchorage AK West
## 1 101648 Marion Military Institute Marion AL South
## 2 100830 Auburn University at Montgomery Montgomery AL South
## 3 101879 University of North Alabama Florence AL South
## 4 100858 Auburn University Auburn AL South
## id name city state region
## 2 100830 Auburn University at Montgomery Montgomery AL South
## 4 100858 Auburn University Auburn AL South
## 5 100663 University of Alabama at Birmingham Birmingham AL South
## 9 100751 The University of Alabama Tuscaloosa AL South
## 11 100706 University of Alabama in Huntsville Huntsville AL South
## 18 100937 Birmingham Southern College Birmingham AL South
## 21 100724 Alabama State University Montgomery AL South
## 23 100654 Alabama A & M University Normal AL South
2.2.4.3 Filtering Rows Based on a Single Condition:
df = df_csv.copy()
# Select rows where the 'column_name' equals a specific value
df[df['name'] == 'Auburn University']
## id name city state region
## 4 100858 Auburn University Auburn AL South
## id name city state region
## 4 100858 Auburn University Auburn AL South
2.2.4.4 Filtering Rows Based on Multiple Conditions (AND):
# Select rows where 'column1' equals 'value1' and 'column2' equals 'value2'
df[(df.city == 'Birmingham') & (df.state == 'AL')]
## id name city state region
## 5 100663 University of Alabama at Birmingham Birmingham AL South
## 7 102049 Samford University Birmingham AL South
## 10 102261 Southeastern Bible College Birmingham AL South
## 18 100937 Birmingham Southern College Birmingham AL South
2.2.5 copy
When you use loc/iloc to create a subset of the DataFrame, you are also creating a view into the original DataFrame. Modifications to the subset will reflect in the original DataFrame, and vice versa.
subset = df.loc['row_label1':'row_label5', 'column_labelA':'column_labelB']
subset = df.iloc[0:5, 1:3] # Select rows 0 to 4 and columns 1 to 2
To create an independent copy of the DataFrame, you can use the copy() method. This ensures that modifications to the copied DataFrame do not affect the original DataFrame.