Chapter 1 PySpark

1.1 Statistics

1.1.1 Correlation

import findspark
findspark.init()
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext

from pyspark.conf import SparkConf
from pyspark.sql.types import * 
import pyspark.sql.functions as F
from pyspark.sql.functions import col, asc,desc
from pyspark.mllib.stat import Statistics
from pyspark.sql.functions import udf
from pyspark.ml.feature import OneHotEncoder, StringIndexer, VectorAssembler,StandardScaler
from pyspark.ml import Pipeline

# Create a Spark session
spark=SparkSession.builder \
    .master ("local[*]")\
    .appName("correlation")\
    .getOrCreate()

# Get the Spark context
sc = spark.sparkContext

# Create a SQL context
sqlContext = SQLContext(sc)
## /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pyspark/sql/context.py:112: FutureWarning: Deprecated in 3.0.0. Use SparkSession.builder.getOrCreate() instead.
##   warnings.warn(
data_path = 'data/diabetes.csv'
df_diabetes=spark.read \
      .option("header","True")\
      .option("inferSchema","True")\
      .option("sep",",")\
      .csv(data_path)
#####################################################
print("There are", df_diabetes.count(), "rows", len(df_diabetes.columns),
      "columns in the data.") 
## There are 768 rows 9 columns in the data.

Exploring the Data

Before diving into correlation analysis, let’s take a quick look at the dataset:

my_columns = ["Outcome", "Glucose", "BloodPressure","BMI"]

df_diabetes.select(my_columns).show(5)
## +-------+-------+-------------+----+
## |Outcome|Glucose|BloodPressure| BMI|
## +-------+-------+-------------+----+
## |      1|    148|           72|33.6|
## |      0|     85|           66|26.6|
## |      1|    183|           64|23.3|
## |      0|     89|           66|28.1|
## |      1|    137|           40|43.1|
## +-------+-------+-------------+----+
## only showing top 5 rows

Pearson Correlation

Pearson correlation, named after Karl Pearson, is a statistical measure that quantifies the strength and direction of a linear relationship between two continuous variables. It assesses how well the relationship between the two variables can be described by a straight line. The Pearson correlation coefficient, often denoted by \(r\), ranges from -1 to 1, where:

  • \(r = 1\): Perfect positive linear correlation.
  • \(r = -1\): Perfect negative (inverse) linear correlation.
  • \(r = 0\): No linear correlation.

**Key Points:*

  1. Linearity: Pearson correlation measures the linear relationship between two variables. If the relationship is nonlinear, Pearson correlation may not accurately reflect the strength and direction of the association.

  2. Calculation: The Pearson correlation coefficient is calculated as the covariance of the two variables divided by the product of their standard deviations:

    \[ r = \frac{\text{cov}(X, Y)}{\sigma_X \cdot \sigma_Y} \]

    where \(\text{cov}(X, Y)\) is the covariance between variables \(X\) and \(Y\), and \(\sigma_X\) and \(\sigma_Y\) are their respective standard deviations.

  3. Strength of Correlation: The absolute value of \(r\) indicates the strength of the correlation. A value close to 1 suggests a strong linear correlation, while a value close to 0 indicates a weak or no linear correlation.

  4. Direction of Correlation: The sign of \(r\) (+ or -) indicates the direction of the correlation. If \(r > 0\), it indicates a positive correlation (as one variable increases, the other tends to increase). If \(r < 0\), it indicates a negative correlation (as one variable increases, the other tends to decrease).

  5. Assumptions: Pearson correlation assumes that the relationship between the variables is linear and that the data is normally distributed. If these assumptions are violated, the reliability of Pearson correlation results may be compromised.

  6. Use Cases: Pearson correlation is widely used in various fields, including economics, biology, psychology, and many others, to analyze the degree and direction of association between two continuous variables.

It’s important to note that while Pearson correlation is a powerful tool for linear relationships, it might not capture complex or nonlinear associations. In such cases, other correlation measures like Spearman correlation (for monotonic relationships) or more advanced techniques may be considered.

from pyspark.ml.feature import VectorAssembler

# Selecting relevant features
selected_features = ["Glucose", "BloodPressure", "BMI"]

# Assemble features into a vector
assembler = VectorAssembler(inputCols=selected_features, outputCol="features")
df_diabetes_vectorized = assembler.transform(df_diabetes.select(selected_features))

Calculate Correlation Coefficients

Now, let’s calculate the Pearson correlation matrix using PySpark:

from pyspark.ml.stat import Correlation

# Calculate Pearson correlation matrix
pearson_corr_matrix = Correlation.corr(df_diabetes_vectorized, 
                                      "features", 
                                      method="pearson").head()

pearson_corr_matrix
## Row(pearson(features)=DenseMatrix(3, 3, [1.0, 0.1526, 0.2211, 0.1526, 1.0, 0.2818, 0.2211, 0.2818, 1.0], False))
Correlation.corr(df_diabetes_vectorized,
                "features", 
                method="pearson").head()[0]
## DenseMatrix(3, 3, [1.0, 0.1526, 0.2211, 0.1526, 1.0, 0.2818, 0.2211, 0.2818, 1.0], False)
# Extract the correlation matrix as a DenseMatrix
corr_values = pearson_corr_matrix[0].toArray()
corr_values
## array([[1.        , 0.15258959, 0.22107107],
##        [0.15258959, 1.        , 0.28180529],
##        [0.22107107, 0.28180529, 1.        ]])

Spearman Correlation

Spearman correlation, named after Charles Spearman, is a statistical measure that assesses the strength and direction of monotonic relationships between two variables. Unlike Pearson correlation, which measures linear relationships, Spearman correlation focuses on the monotonicity of the relationship, whether it is increasing or decreasing.

Key points about Spearman correlation:

  1. Monotonic Relationship: Spearman correlation is based on the ranks of the data rather than the actual values. It looks at whether, when one variable increases, the other tends to also increase (and vice versa) in a monotonic fashion. It does not assume a linear relationship.

  2. Ranking: The data for each variable is ranked, and the correlation is calculated based on the ranks. If the variables have a perfect monotonic relationship, the Spearman correlation will be 1. If one variable tends to increase as the other decreases, the correlation will be -1. Calculation: The Spearman correlation coefficient (\(\rho\)) is calculated as follows:

\[\rho = 1 - \frac{6 \sum d_i^2}{n(n^2 - 1)}\] where \(d_i\) is the difference between the ranks of corresponding values for the two variables, and \(n\) is the number of data points.

  1. Strength of Correlation: As with Pearson correlation, the absolute value of the Spearman correlation coefficient indicates the strength of the relationship. A coefficient close to 1 or -1 suggests a strong monotonic relationship, while a coefficient close to 0 suggests a weak or no monotonic relationship.

  2. Use Cases: Spearman correlation is particularly useful when dealing with ordinal or ranked data, as it doesn’t rely on the assumption of normality. It’s commonly used in social sciences, finance, and any other fields where the relationship between variables might be nonlinear.

  3. Ranking Ties: In cases where there are tied values (two or more identical values) in the data, special methods are used to handle ties in the ranking process.

  4. Spearman correlation is a valuable tool when the assumptions of linear relationships are not met, or when dealing with data where the exact values are less important than their relative order. It provides a robust measure of association between variables without requiring a specific functional form for the relationship.

1.1.1.1 Custom Correlation Method

def correlation_df(df, target_var, feature_cols, method):

    from pyspark.ml.feature import VectorAssembler
    from pyspark.ml.stat import Correlation

    # Assemble features into a vector
    target_var = [target_var]
    feature_cols = feature_cols
    df_cor = df.select(target_var + feature_cols)
    assembler = VectorAssembler(inputCols=target_var+feature_cols, outputCol="features")
    df_cor = assembler.transform(df_cor)

    # Calculate correlation matrix
    correlation_matrix = Correlation.corr(df_cor, "features", method=method).head()

    # Extract the correlation coefficient between target and each feature
    target_corr_list = [correlation_matrix[0][i, 0] for i in range(len(feature_cols)+1)][1:]

    # Create a DataFrame with target variable, feature names, and correlation coefficients
    correlation_data = [(feature_cols[i], float(target_corr_list[i])) for i in range(len(feature_cols))]

    correlation_df = spark.createDataFrame(correlation_data, ["feature", "correlation"])
    # Print the result
    return correlation_df
target = 'Outcome'
inputs = ["Glucose", "BloodPressure","BMI"]

df = correlation_df(df = df_diabetes, 
                    target_var = target, 
                    feature_cols = inputs,
                    method = 'spearman')
df.show()
## +-------------+-------------------+
## |      feature|        correlation|
## +-------------+-------------------+
## |      Glucose|0.47577630645832697|
## |BloodPressure| 0.1429206751899304|
## |          BMI|0.30970674263113884|
## +-------------+-------------------+
target = 'Outcome'
inputs = ["Glucose", "BloodPressure","BMI"]

df_p = correlation_df(df = df_diabetes, 
                    target_var = target, 
                    feature_cols = inputs,
                    method = 'pearson')
                    
df_p.show()
## +-------------+-------------------+
## |      feature|        correlation|
## +-------------+-------------------+
## |      Glucose| 0.4665813983068757|
## |BloodPressure| 0.0650683595503333|
## |          BMI|0.29269466264444677|
## +-------------+-------------------+
# Convert Spark DataFrame to Pandas DataFrame and sort by 'cola' in descending order
pandas_df = df_p.toPandas().sort_values(by='correlation', ascending=False)
pandas_df
##          feature  correlation
## 0        Glucose     0.466581
## 2            BMI     0.292695
## 1  BloodPressure     0.065068