Chapter 3 Pandas Library

run pip3 install pandas or run !pip install pandas on rstudio terminal or mac terminal or jupyter notebook

import pandas as pd 

pd.set_option('display.max_columns', None)

3.0.1 sorting

import pandas as pd

# Create a Pandas Series
series = pd.Series([3, 1, 4, 1, 5, 9, 2])

# Sort the Series by values
sorted_series = series.sort_values()

3.0.2 query method

The query method in Python, specifically in the pandas library, is a powerful tool for data scientists when it comes to filtering and selecting data from a DataFrame. Here’s an overview of the query method in the context of data science:

Purpose of the query Method:

  • Simplifies Data Filtering: The query method allows you to filter data using a string expression, which is often more intuitive and readable than traditional boolean indexing.

  • Improves Readability: By using query, complex filtering conditions can be written in a way that resembles SQL, making the code easier to understand and maintain.

Key Features and Advantages:

  1. Readability and Simplicity:
  • The query method lets you filter DataFrames using natural language-like expressions.

For example:

df.query("column_name > value")
  • This is easier to read than:
df[df['column_name'] > value]
  1. Support for Local Variables:
  • You can reference local Python variables inside the query expression by prefixing them with @. This is useful when the filtering criteria are dynamic or based on external conditions.
threshold = 90

df.query("grade > @threshold")
  1. Chaining Queries:
  • The query method can be chained to apply multiple filters sequentially, which can be more readable than combining multiple conditions using & or |.
df.query("grade > 90").query("favorite_color == 'red'")
  1. Avoiding Complex Boolean Indexing:
  • In complex scenarios where multiple conditions need to be applied, boolean indexing can become cumbersome. The query method simplifies this by allowing conditions to be expressed in a single line.
df.query("age > 20 and score < 80")

3.0.2.1 Considerations:

  • Performance: While query is readable, it might be slightly slower than traditional indexing methods for very large DataFrames. However, the difference is often negligible in most data science applications.

  • Syntax Limitations: The query method only supports a subset of Python syntax, so certain complex operations may still require traditional methods.

3.1 read data

3.1.1 csv file

IBM sample data: I could not run with “https” because I did not have a certificate installed. So, I go on with “http” and it worked.

data_link = "http://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DS0103EN/labs/data/recipes.csv"

recipes = pd.read_csv(data_link)

3.1.2 xlsx file

pandas.read_excel(io = path, sheet_name = 0, header = 0, names = None, index_col = None, usecols = None)

This returns dataframe object.

df = pd.read_excel(io = "./data/segmentation.xlsx",
                  sheet_name = "sheet")
## [Errno 2] No such file or directory: './data/segmentation.xlsx'
                  
df.head(n=5)
## name 'df' is not defined