Week 6: Correlation and Regressions in Python¶
This week, you will learn how to conduct correlation and linear regression analysis in Python
1. Linking Datasets in Python¶
In Environmental Sciences, we often come up with the ideas of exploring the relationship between two sets of measurements (e.g., How is air quality correlated with temperature?). The first step of the exploratory correlation analysis is to link the two datasets by matching the measurements taken at the same time and same place. Matching datasets has been a pain in Excel, but with the indexing functionality in Pandas, it is convenient to link datasets.
#Getting all the packages we need:
import pandas as pd
import numpy as np
Here we use an example to explore the correlation between ozone air quality and temperature at Millbrook Site in New York. We have two datasets: (1) Daily ozone measurements at this site (https://www.epa.gov/outdoor-air-quality-data/download-daily-data); (2) NOAA weather station data at this site (https://www.ncdc.noaa.gov/data-access/land-based-station-data).
First, let's read in these two datasets using Pandas read_csv
fuction:
df_AQS = pd.read_csv('Millbrook_NY_daily_ozone_2022.csv', parse_dates = ['Date'])
df_AQS.head()
Date | Source | Site ID | POC | Daily Max 8-hour Ozone Concentration | Units | Daily AQI Value | Local Site Name | Daily Obs Count | Percent Complete | ... | AQS Parameter Description | Method Code | CBSA Code | CBSA Name | State FIPS Code | State | County FIPS Code | County | Site Latitude | Site Longitude | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2022-01-15 | AQS | 360270007 | 1 | 0.034 | ppm | 31 | MILLBROOK | 17 | 100.0 | ... | Ozone | 87 | 35620 | New York-Newark-Jersey City, NY-NJ-PA | 36 | New York | 27 | Dutchess | 41.78555 | -73.74136 |
1 | 2022-01-16 | AQS | 360270007 | 1 | 0.037 | ppm | 34 | MILLBROOK | 17 | 100.0 | ... | Ozone | 87 | 35620 | New York-Newark-Jersey City, NY-NJ-PA | 36 | New York | 27 | Dutchess | 41.78555 | -73.74136 |
2 | 2022-01-17 | AQS | 360270007 | 1 | 0.038 | ppm | 35 | MILLBROOK | 17 | 100.0 | ... | Ozone | 87 | 35620 | New York-Newark-Jersey City, NY-NJ-PA | 36 | New York | 27 | Dutchess | 41.78555 | -73.74136 |
3 | 2022-01-18 | AQS | 360270007 | 1 | 0.042 | ppm | 39 | MILLBROOK | 17 | 100.0 | ... | Ozone | 87 | 35620 | New York-Newark-Jersey City, NY-NJ-PA | 36 | New York | 27 | Dutchess | 41.78555 | -73.74136 |
4 | 2022-01-19 | AQS | 360270007 | 1 | 0.029 | ppm | 27 | MILLBROOK | 17 | 100.0 | ... | Ozone | 87 | 35620 | New York-Newark-Jersey City, NY-NJ-PA | 36 | New York | 27 | Dutchess | 41.78555 | -73.74136 |
5 rows × 21 columns
df_weather = pd.read_csv('Millbrook_NY_daily_weather.csv', parse_dates = ['LST_DATE'])
df_weather.head()
LST_DATE | WBANNO | CRX_VN | LONGITUDE | LATITUDE | T_DAILY_MAX | T_DAILY_MIN | T_DAILY_MEAN | T_DAILY_AVG | P_DAILY_CALC | ... | SOIL_MOISTURE_10_DAILY | SOIL_MOISTURE_20_DAILY | SOIL_MOISTURE_50_DAILY | SOIL_MOISTURE_100_DAILY | SOIL_TEMP_5_DAILY | SOIL_TEMP_10_DAILY | SOIL_TEMP_20_DAILY | SOIL_TEMP_50_DAILY | SOIL_TEMP_100_DAILY | Unnamed: 28 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-01-01 | 64756 | 2.422 | -73.74 | 41.79 | 3.4 | -0.5 | 1.5 | 1.3 | 0.0 | ... | 0.233 | 0.204 | 0.155 | 0.147 | 4.2 | 4.4 | 5.1 | 6.0 | 7.6 | NaN |
1 | 2016-01-02 | 64756 | 2.422 | -73.74 | 41.79 | 2.9 | -3.6 | -0.4 | -0.3 | 0.0 | ... | 0.227 | 0.199 | 0.152 | 0.144 | 2.8 | 3.1 | 4.2 | 5.7 | 7.4 | NaN |
2 | 2016-01-03 | 64756 | 2.422 | -73.74 | 41.79 | 5.1 | -1.8 | 1.6 | 1.1 | 0.0 | ... | 0.223 | 0.196 | 0.151 | 0.141 | 2.6 | 2.8 | 3.8 | 5.2 | 7.2 | NaN |
3 | 2016-01-04 | 64756 | 2.422 | -73.74 | 41.79 | 0.5 | -14.4 | -6.9 | -7.5 | 0.0 | ... | 0.220 | 0.194 | 0.148 | 0.139 | 1.7 | 2.1 | 3.4 | 4.9 | 6.9 | NaN |
4 | 2016-01-05 | 64756 | 2.422 | -73.74 | 41.79 | -5.2 | -15.5 | -10.3 | -11.7 | 0.0 | ... | 0.213 | 0.191 | 0.148 | 0.138 | 0.4 | 0.9 | 2.4 | 4.3 | 6.6 | NaN |
5 rows × 29 columns
df_weather.columns
Index(['LST_DATE', 'WBANNO', 'CRX_VN', 'LONGITUDE', 'LATITUDE', 'T_DAILY_MAX', 'T_DAILY_MIN', 'T_DAILY_MEAN', 'T_DAILY_AVG', 'P_DAILY_CALC', 'SOLARAD_DAILY', 'SUR_TEMP_DAILY_TYPE', 'SUR_TEMP_DAILY_MAX', 'SUR_TEMP_DAILY_MIN', 'SUR_TEMP_DAILY_AVG', 'RH_DAILY_MAX', 'RH_DAILY_MIN', 'RH_DAILY_AVG', 'SOIL_MOISTURE_5_DAILY', 'SOIL_MOISTURE_10_DAILY', 'SOIL_MOISTURE_20_DAILY', 'SOIL_MOISTURE_50_DAILY', 'SOIL_MOISTURE_100_DAILY', 'SOIL_TEMP_5_DAILY', 'SOIL_TEMP_10_DAILY', 'SOIL_TEMP_20_DAILY', 'SOIL_TEMP_50_DAILY', 'SOIL_TEMP_100_DAILY', 'Unnamed: 28'], dtype='object')
Next, we need to merge the two dataframes by matching the date and location. In this dataset, as there is only one site, we don't need to match the location, so we just need to match the date. Set the Date
column as index.
df_AQS = df_AQS.set_index('Date')
df_weather = df_weather.set_index('LST_DATE')
If we only need to add selected columns in df_weather
to df_AQS
, we just assign the new columns to df_AQS
. For single column:
df_AQS['T_DAILY_MEAN'] = df_weather['T_DAILY_MEAN']
For multiple columns:
df_AQS[['T_DAILY_MAX','T_DAILY_MIN','RH_DAILY_MAX']] = df_weather[['T_DAILY_MAX','T_DAILY_MIN','RH_DAILY_MAX']]
Now you should see df_AQS
has three new columns with temperature. We've merged these two datasets!
df_AQS
Source | Site ID | POC | Daily Max 8-hour Ozone Concentration | Units | Daily AQI Value | Local Site Name | Daily Obs Count | Percent Complete | AQS Parameter Code | ... | State FIPS Code | State | County FIPS Code | County | Site Latitude | Site Longitude | T_DAILY_MEAN | T_DAILY_MAX | T_DAILY_MIN | RH_DAILY_MAX | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Date | |||||||||||||||||||||
2022-01-15 | AQS | 360270007 | 1 | 0.034 | ppm | 31 | MILLBROOK | 17 | 100.0 | 44201 | ... | 36 | New York | 27 | Dutchess | 41.78555 | -73.74136 | -14.7 | -11.6 | -17.9 | 60.0 |
2022-01-16 | AQS | 360270007 | 1 | 0.037 | ppm | 34 | MILLBROOK | 17 | 100.0 | 44201 | ... | 36 | New York | 27 | Dutchess | 41.78555 | -73.74136 | -10.4 | -1.1 | -19.7 | 87.1 |
2022-01-17 | AQS | 360270007 | 1 | 0.038 | ppm | 35 | MILLBROOK | 17 | 100.0 | 44201 | ... | 36 | New York | 27 | Dutchess | 41.78555 | -73.74136 | 2.0 | 5.2 | -1.1 | 87.1 |
2022-01-18 | AQS | 360270007 | 1 | 0.042 | ppm | 39 | MILLBROOK | 17 | 100.0 | 44201 | ... | 36 | New York | 27 | Dutchess | 41.78555 | -73.74136 | -4.8 | 0.3 | -9.9 | 66.3 |
2022-01-19 | AQS | 360270007 | 1 | 0.029 | ppm | 27 | MILLBROOK | 17 | 100.0 | 44201 | ... | 36 | New York | 27 | Dutchess | 41.78555 | -73.74136 | -2.4 | 5.8 | -10.6 | 76.8 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2022-12-26 | AQS | 360270007 | 1 | 0.033 | ppm | 31 | MILLBROOK | 17 | 100.0 | 44201 | ... | 36 | New York | 27 | Dutchess | 41.78555 | -73.74136 | -6.8 | -2.4 | -11.1 | 70.9 |
2022-12-27 | AQS | 360270007 | 1 | 0.030 | ppm | 28 | MILLBROOK | 17 | 100.0 | 44201 | ... | 36 | New York | 27 | Dutchess | 41.78555 | -73.74136 | -4.4 | -0.8 | -8.0 | 81.9 |
2022-12-28 | AQS | 360270007 | 1 | 0.027 | ppm | 25 | MILLBROOK | 17 | 100.0 | 44201 | ... | 36 | New York | 27 | Dutchess | 41.78555 | -73.74136 | 0.7 | 7.4 | -6.1 | 80.8 |
2022-12-29 | AQS | 360270007 | 1 | 0.030 | ppm | 28 | MILLBROOK | 17 | 100.0 | 44201 | ... | 36 | New York | 27 | Dutchess | 41.78555 | -73.74136 | 4.4 | 10.7 | -1.8 | 74.3 |
2022-12-30 | AQS | 360270007 | 1 | 0.028 | ppm | 26 | MILLBROOK | 17 | 100.0 | 44201 | ... | 36 | New York | 27 | Dutchess | 41.78555 | -73.74136 | 10.7 | 16.6 | 4.9 | 68.5 |
339 rows × 24 columns
Exercise 1.1: Combine the two dataframes by adding 'Daily Max 8-hour Ozone Concentration' from df_AQS
to df_weather
.¶
Exercise 1.2: Use describe
to get the basic statistics of the combined dataset df_AQS and df_weather. Can you tell what's the difference between these two dataframes?¶
2. Correlation Analysis¶
2.1 Scatter Plots¶
Scatter plots allow you to quickly see if two variables are positively, negatively, or not correlated at all. You can directly use the plotting functions in Pandas:
df_AQS.plot.scatter('T_DAILY_MEAN','Daily Max 8-hour Ozone Concentration')
<Axes: xlabel='T_DAILY_MEAN', ylabel='Daily Max 8-hour Ozone Concentration'>
2.2 Pearson Correlation Coefficient¶
The Pearson correlation coefficient (often denoted as r) is a statistical measure that quantifies the strength and direction of the linear relationship between two continuous variables.
The Pearson correlation coefficient r is calculated using the following formula:
$$ r = \frac{\sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^{n} (x_i - \bar{x})^2 \sum_{i=1}^{n} (y_i - \bar{y})^2}} $$
where $x_i$ and $y_i$ are individual data points for variables $x$ and $y$; $\bar{x}$ and $\bar{y}$ are the means of $x$ and $y$; $n$ is the number of data points.
Exercise 2: Using the above formula to calculate the Pearson Correlation Coefficient r between 'Daily Max 8-hour Ozone Concentration' and 'T_DAILY_MEAN' in df_AQS
¶
You may find it a pain to calculate R value by hand. You just need to do it once! In Pandas, you can directly calculate Pearson Correlation Coefficient using the corr
function.
pearson_corr = df_AQS['T_DAILY_MEAN'].corr(df_AQS['Daily Max 8-hour Ozone Concentration'], method='pearson')
pearson_corr
np.float64(0.3047365963742261)
Check if the value is the same as the r value you calculated in Exercise 2.
2.3 Spearman Rank Correlation Coefficient¶
In Week 2, we learned that rank-based statistics are more appropriate for censored datasets with non-detects.
The Spearman rank correlation coefficient (denoted as ρ) is a non-parametric measure of the strength and direction of the monotonic relationship between two ranked variables. Spearman correlation works on the ranks of the data, not the actual values. Each data point is assigned a rank, and the correlation is calculated based on these ranks. Like Pearson correlation, the Spearman correlation coefficient ranges from -1 to 1:
- ρ=1: Perfect positive monotonic relationship. As one variable increases, the other increases as well.
- ρ=−1: Perfect negative monotonic relationship. As one variable increases, the other decreases.
- ρ=0: No monotonic relationship between the variables.
Spearman rank correlation coefficient does not assume normality or linearity. It can be applied to ordinal, interval, or ratio data. Since Spearman uses ranks, it is more robust to outliers and non-detects compared to Pearson correlation.
To calculate the Spearman rank correlation coefficient in Pandas, we just simply need to change the method argument to 'spearman':
spearman_corr = df_AQS['T_DAILY_MEAN'].corr(df_AQS['Daily Max 8-hour Ozone Concentration'], method='spearman')
spearman_corr
np.float64(0.3031817198225167)
2.4 Correlation Matrix¶
Correlation matrix is a numerical representation (a matrix of the correlation coefficient (r)) showing the strength of the relationship among all the variables in your dataset.
# Calculate Pearson correlation matrix for multiple numeric columns
corr_matrix = df_AQS[['T_DAILY_MAX','T_DAILY_MIN','T_DAILY_MEAN','Daily Max 8-hour Ozone Concentration']].corr(method='pearson', numeric_only = True)
# We can visualize the correlation metrics by setting the style of the output.
corr_matrix.style.background_gradient(cmap='Blues')
T_DAILY_MAX | T_DAILY_MIN | T_DAILY_MEAN | Daily Max 8-hour Ozone Concentration | |
---|---|---|---|---|
T_DAILY_MAX | 1.000000 | 0.905145 | 0.978251 | 0.379406 |
T_DAILY_MIN | 0.905145 | 1.000000 | 0.973623 | 0.206968 |
T_DAILY_MEAN | 0.978251 | 0.973623 | 1.000000 | 0.304737 |
Daily Max 8-hour Ozone Concentration | 0.379406 | 0.206968 | 0.304737 | 1.000000 |
3. Linear Regression¶
There are multiple ways to conduct linear regression in Python. Here we use statsmodels
library in Python, which provides classes and functions for the estimation of many different statistical models, as well as for conducting statistical tests, and statistical data exploration.
import statsmodels.api as sm
# Define the independent (X) and dependent (y) variables
X = df_AQS['T_DAILY_MAX'] # Independent variable
y = df_AQS['Daily Max 8-hour Ozone Concentration']*1000. # Dependent variable
# Add a constant (intercept) to X
X_with_const = sm.add_constant(X)
model = sm.OLS(y, X_with_const).fit()
model.summary()
Dep. Variable: | Daily Max 8-hour Ozone Concentration | R-squared: | 0.144 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.141 |
Method: | Least Squares | F-statistic: | 56.67 |
Date: | Sun, 29 Sep 2024 | Prob (F-statistic): | 4.76e-13 |
Time: | 15:19:23 | Log-Likelihood: | -1226.9 |
No. Observations: | 339 | AIC: | 2458. |
Df Residuals: | 337 | BIC: | 2465. |
Df Model: | 1 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
const | 32.7306 | 0.903 | 36.251 | 0.000 | 30.955 | 34.507 |
T_DAILY_MAX | 0.3396 | 0.045 | 7.528 | 0.000 | 0.251 | 0.428 |
Omnibus: | 2.910 | Durbin-Watson: | 0.972 |
---|---|---|---|
Prob(Omnibus): | 0.233 | Jarque-Bera (JB): | 2.221 |
Skew: | 0.007 | Prob(JB): | 0.329 |
Kurtosis: | 2.604 | Cond. No. | 36.8 |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
# Extract the intercept and slope
intercept, slope = model.params
# Print the slope and intercept
print(f"Intercept: {intercept}")
print(f"Slope (Coefficient): {slope}")
Intercept: 32.73055575275309 Slope (Coefficient): 0.33960906163846516
Exercise 3: Based on what you learned from the video, explain the linear regression results summary:¶
A. What do the slope and intercept mean here? What are the their units?
B. What does R-squared mean here?
C. Using the linear relationship, predict the level of ozone at 30 ˚C.
D. What is degree of freedom of this linear model?
4. Multiple Linear Regression¶
If you have multiple independent variables, the process is similar. You just need to include more columns in your X.
# Define the independent (X) and dependent (y) variables
X = df_AQS[['T_DAILY_MAX','RH_DAILY_MAX']] # Independent variables
y = df_AQS['Daily Max 8-hour Ozone Concentration']*1000. # Dependent variable
# Add a constant (intercept) to X
X_with_const = sm.add_constant(X)
# Fit the linear regression model
model = sm.OLS(y, X_with_const).fit()
# Get the regression results summary
model.summary()
Dep. Variable: | Daily Max 8-hour Ozone Concentration | R-squared: | 0.205 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.200 |
Method: | Least Squares | F-statistic: | 43.34 |
Date: | Sun, 29 Sep 2024 | Prob (F-statistic): | 1.80e-17 |
Time: | 15:19:53 | Log-Likelihood: | -1214.4 |
No. Observations: | 339 | AIC: | 2435. |
Df Residuals: | 336 | BIC: | 2446. |
Df Model: | 2 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
const | 59.2369 | 5.287 | 11.203 | 0.000 | 48.836 | 69.638 |
T_DAILY_MAX | 0.4494 | 0.049 | 9.246 | 0.000 | 0.354 | 0.545 |
RH_DAILY_MAX | -0.3274 | 0.064 | -5.083 | 0.000 | -0.454 | -0.201 |
Omnibus: | 3.131 | Durbin-Watson: | 1.031 |
---|---|---|---|
Prob(Omnibus): | 0.209 | Jarque-Bera (JB): | 2.347 |
Skew: | 0.017 | Prob(JB): | 0.309 |
Kurtosis: | 2.594 | Cond. No. | 988. |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Exercise 4: Based on what you learned from the video, explain the multiple linear regression results summary:¶
A. What do the coefficients mean here? What are the their units?
B. What is degree of freedom of this model?
C. What is the difference between R-squared and Adj. R-squared?
5. Assignment: Apply correlation and regression analysis to the datasets you found in Week 2 (or any other datasets that you use).¶
Note: If you are both careful and lucky (i.e. you are lucky enough to choose the right data set), you should be able to meld this weekly assignment into your final paper.
5.1 In one sentence, explain the purpose of the analysis you will do.¶
5.2 Correlation analysis¶
5.3 Linear regression analysis¶