Week 6: Correlation and Regressions in Python¶

This week, you will learn how to conduct correlation and linear regression analysis in Python

1. Linking Datasets in Python¶

In Environmental Sciences, we often come up with the ideas of exploring the relationship between two sets of measurements (e.g., How is air quality correlated with temperature?). The first step of the exploratory correlation analysis is to link the two datasets by matching the measurements taken at the same time and same place. Matching datasets has been a pain in Excel, but with the indexing functionality in Pandas, it is convenient to link datasets.

In [1]:

Copied!

#Getting all the packages we need: 
import pandas as pd 
import numpy as np
#Getting all the packages we need: 
import pandas as pd 
import numpy as np

In [ ]:

Here we use an example to explore the correlation between ozone air quality and temperature at Millbrook Site in New York. We have two datasets: (1) Daily ozone measurements at this site (https://www.epa.gov/outdoor-air-quality-data/download-daily-data); (2) NOAA weather station data at this site (https://www.ncdc.noaa.gov/data-access/land-based-station-data).

First, let's read in these two datasets using Pandas read_csv fuction:

In [9]:

Copied!

df_AQS = pd.read_csv('Millbrook_NY_daily_ozone_2022.csv', parse_dates = ['Date'])
df_AQS = pd.read_csv('Millbrook_NY_daily_ozone_2022.csv', parse_dates = ['Date'])

In [10]:

Copied!

df_AQS.head()
df_AQS.head()

Out[10]:

	Date	Source	Site ID	POC	Daily Max 8-hour Ozone Concentration	Units	Daily AQI Value	Local Site Name	Daily Obs Count	Percent Complete	...	AQS Parameter Description	Method Code	CBSA Code	CBSA Name	State FIPS Code	State	County FIPS Code	County	Site Latitude	Site Longitude
0	2022-01-15	AQS	360270007	1	0.034	ppm	31	MILLBROOK	17	100.0	...	Ozone	87	35620	New York-Newark-Jersey City, NY-NJ-PA	36	New York	27	Dutchess	41.78555	-73.74136
1	2022-01-16	AQS	360270007	1	0.037	ppm	34	MILLBROOK	17	100.0	...	Ozone	87	35620	New York-Newark-Jersey City, NY-NJ-PA	36	New York	27	Dutchess	41.78555	-73.74136
2	2022-01-17	AQS	360270007	1	0.038	ppm	35	MILLBROOK	17	100.0	...	Ozone	87	35620	New York-Newark-Jersey City, NY-NJ-PA	36	New York	27	Dutchess	41.78555	-73.74136
3	2022-01-18	AQS	360270007	1	0.042	ppm	39	MILLBROOK	17	100.0	...	Ozone	87	35620	New York-Newark-Jersey City, NY-NJ-PA	36	New York	27	Dutchess	41.78555	-73.74136
4	2022-01-19	AQS	360270007	1	0.029	ppm	27	MILLBROOK	17	100.0	...	Ozone	87	35620	New York-Newark-Jersey City, NY-NJ-PA	36	New York	27	Dutchess	41.78555	-73.74136

5 rows × 21 columns

In [11]:

Copied!

df_weather = pd.read_csv('Millbrook_NY_daily_weather.csv', parse_dates = ['LST_DATE'])
df_weather = pd.read_csv('Millbrook_NY_daily_weather.csv', parse_dates = ['LST_DATE'])

In [12]:

Copied!

df_weather.head()
df_weather.head()

Out[12]:

	LST_DATE	WBANNO	CRX_VN	LONGITUDE	LATITUDE	T_DAILY_MAX	T_DAILY_MIN	T_DAILY_MEAN	T_DAILY_AVG	...	SOIL_MOISTURE_10_DAILY	SOIL_MOISTURE_20_DAILY	SOIL_MOISTURE_50_DAILY	SOIL_MOISTURE_100_DAILY	SOIL_TEMP_5_DAILY	SOIL_TEMP_10_DAILY	SOIL_TEMP_20_DAILY	SOIL_TEMP_50_DAILY	SOIL_TEMP_100_DAILY	Unnamed: 28
0	2016-01-01	64756	2.422	-73.74	41.79	3.4	-0.5	1.5	1.3	...	0.233	0.204	0.155	0.147	4.2	4.4	5.1	6.0	7.6	NaN
1	2016-01-02	64756	2.422	-73.74	41.79	2.9	-3.6	-0.4	-0.3	...	0.227	0.199	0.152	0.144	2.8	3.1	4.2	5.7	7.4	NaN
2	2016-01-03	64756	2.422	-73.74	41.79	5.1	-1.8	1.6	1.1	...	0.223	0.196	0.151	0.141	2.6	2.8	3.8	5.2	7.2	NaN
3	2016-01-04	64756	2.422	-73.74	41.79	0.5	-14.4	-6.9	-7.5	...	0.220	0.194	0.148	0.139	1.7	2.1	3.4	4.9	6.9	NaN
4	2016-01-05	64756	2.422	-73.74	41.79	-5.2	-15.5	-10.3	-11.7	...	0.213	0.191	0.148	0.138	0.4	0.9	2.4	4.3	6.6	NaN

5 rows × 29 columns

In [13]:

Copied!

df_weather.columns
df_weather.columns

Out[13]:

Index(['LST_DATE', 'WBANNO', 'CRX_VN', 'LONGITUDE', 'LATITUDE', 'T_DAILY_MAX',
       'T_DAILY_MIN', 'T_DAILY_MEAN', 'T_DAILY_AVG', 'P_DAILY_CALC',
       'SOLARAD_DAILY', 'SUR_TEMP_DAILY_TYPE', 'SUR_TEMP_DAILY_MAX',
       'SUR_TEMP_DAILY_MIN', 'SUR_TEMP_DAILY_AVG', 'RH_DAILY_MAX',
       'RH_DAILY_MIN', 'RH_DAILY_AVG', 'SOIL_MOISTURE_5_DAILY',
       'SOIL_MOISTURE_10_DAILY', 'SOIL_MOISTURE_20_DAILY',
       'SOIL_MOISTURE_50_DAILY', 'SOIL_MOISTURE_100_DAILY',
       'SOIL_TEMP_5_DAILY', 'SOIL_TEMP_10_DAILY', 'SOIL_TEMP_20_DAILY',
       'SOIL_TEMP_50_DAILY', 'SOIL_TEMP_100_DAILY', 'Unnamed: 28'],
      dtype='object')

Next, we need to merge the two dataframes by matching the date and location. In this dataset, as there is only one site, we don't need to match the location, so we just need to match the date. Set the Date column as index.

In [14]:

Copied!

df_AQS = df_AQS.set_index('Date')
df_weather = df_weather.set_index('LST_DATE')
df_AQS = df_AQS.set_index('Date')
df_weather = df_weather.set_index('LST_DATE')

If we only need to add selected columns in df_weather to df_AQS, we just assign the new columns to df_AQS. For single column:

In [15]:

Copied!

df_AQS['T_DAILY_MEAN'] = df_weather['T_DAILY_MEAN']
df_AQS['T_DAILY_MEAN'] = df_weather['T_DAILY_MEAN']

For multiple columns:

In [16]:

Copied!

df_AQS[['T_DAILY_MAX','T_DAILY_MIN','RH_DAILY_MAX']] = df_weather[['T_DAILY_MAX','T_DAILY_MIN','RH_DAILY_MAX']]
df_AQS[['T_DAILY_MAX','T_DAILY_MIN','RH_DAILY_MAX']] = df_weather[['T_DAILY_MAX','T_DAILY_MIN','RH_DAILY_MAX']]

Now you should see df_AQS has three new columns with temperature. We've merged these two datasets!

In [17]:

Copied!

df_AQS
df_AQS

Out[17]:

	Source	Site ID	POC	Daily Max 8-hour Ozone Concentration	Units	Daily AQI Value	Local Site Name	Daily Obs Count	Percent Complete	AQS Parameter Code	...	State FIPS Code	State	County FIPS Code	County	Site Latitude	Site Longitude	T_DAILY_MEAN	T_DAILY_MAX	T_DAILY_MIN	RH_DAILY_MAX
Date
2022-01-15	AQS	360270007	1	0.034	ppm	31	MILLBROOK	17	100.0	44201	...	36	New York	27	Dutchess	41.78555	-73.74136	-14.7	-11.6	-17.9	60.0
2022-01-16	AQS	360270007	1	0.037	ppm	34	MILLBROOK	17	100.0	44201	...	36	New York	27	Dutchess	41.78555	-73.74136	-10.4	-1.1	-19.7	87.1
2022-01-17	AQS	360270007	1	0.038	ppm	35	MILLBROOK	17	100.0	44201	...	36	New York	27	Dutchess	41.78555	-73.74136	2.0	5.2	-1.1	87.1
2022-01-18	AQS	360270007	1	0.042	ppm	39	MILLBROOK	17	100.0	44201	...	36	New York	27	Dutchess	41.78555	-73.74136	-4.8	0.3	-9.9	66.3
2022-01-19	AQS	360270007	1	0.029	ppm	27	MILLBROOK	17	100.0	44201	...	36	New York	27	Dutchess	41.78555	-73.74136	-2.4	5.8	-10.6	76.8
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
2022-12-26	AQS	360270007	1	0.033	ppm	31	MILLBROOK	17	100.0	44201	...	36	New York	27	Dutchess	41.78555	-73.74136	-6.8	-2.4	-11.1	70.9
2022-12-27	AQS	360270007	1	0.030	ppm	28	MILLBROOK	17	100.0	44201	...	36	New York	27	Dutchess	41.78555	-73.74136	-4.4	-0.8	-8.0	81.9
2022-12-28	AQS	360270007	1	0.027	ppm	25	MILLBROOK	17	100.0	44201	...	36	New York	27	Dutchess	41.78555	-73.74136	0.7	7.4	-6.1	80.8
2022-12-29	AQS	360270007	1	0.030	ppm	28	MILLBROOK	17	100.0	44201	...	36	New York	27	Dutchess	41.78555	-73.74136	4.4	10.7	-1.8	74.3
2022-12-30	AQS	360270007	1	0.028	ppm	26	MILLBROOK	17	100.0	44201	...	36	New York	27	Dutchess	41.78555	-73.74136	10.7	16.6	4.9	68.5

339 rows × 24 columns

In [ ]:

Exercise 1.1: Combine the two dataframes by adding 'Daily Max 8-hour Ozone Concentration' from `df_AQS` to `df_weather`.¶

In [ ]:

Exercise 1.2: Use `describe` to get the basic statistics of the combined dataset df_AQS and df_weather. Can you tell what's the difference between these two dataframes?¶

In [ ]:

2. Correlation Analysis¶

2.1 Scatter Plots¶

Scatter plots allow you to quickly see if two variables are positively, negatively, or not correlated at all. You can directly use the plotting functions in Pandas:

In [18]:

Copied!

df_AQS.plot.scatter('T_DAILY_MEAN','Daily Max 8-hour Ozone Concentration')
df_AQS.plot.scatter('T_DAILY_MEAN','Daily Max 8-hour Ozone Concentration')

Out[18]:

<Axes: xlabel='T_DAILY_MEAN', ylabel='Daily Max 8-hour Ozone Concentration'>

No description has been provided for this image

2.2 Pearson Correlation Coefficient¶

The Pearson correlation coefficient (often denoted as r) is a statistical measure that quantifies the strength and direction of the linear relationship between two continuous variables.

The Pearson correlation coefficient r is calculated using the following formula:

$$ r = \frac{\sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^{n} (x_i - \bar{x})^2 \sum_{i=1}^{n} (y_i - \bar{y})^2}} $$

where $x_i$ and $y_i$ are individual data points for variables $x$ and $y$; $\bar{x}$ and $\bar{y}$ are the means of $x$ and $y$; $n$ is the number of data points.

Exercise 2: Using the above formula to calculate the Pearson Correlation Coefficient r between 'Daily Max 8-hour Ozone Concentration' and 'T_DAILY_MEAN' in `df_AQS`¶

In [ ]:

You may find it a pain to calculate R value by hand. You just need to do it once! In Pandas, you can directly calculate Pearson Correlation Coefficient using the corr function.

In [19]:

Copied!

pearson_corr = df_AQS['T_DAILY_MEAN'].corr(df_AQS['Daily Max 8-hour Ozone Concentration'], method='pearson')
pearson_corr = df_AQS['T_DAILY_MEAN'].corr(df_AQS['Daily Max 8-hour Ozone Concentration'], method='pearson')

In [20]:

Copied!

pearson_corr
pearson_corr

Out[20]:

np.float64(0.3047365963742261)

Check if the value is the same as the r value you calculated in Exercise 2.

2.3 Spearman Rank Correlation Coefficient¶

In Week 2, we learned that rank-based statistics are more appropriate for censored datasets with non-detects.

The Spearman rank correlation coefficient (denoted as ρ) is a non-parametric measure of the strength and direction of the monotonic relationship between two ranked variables. Spearman correlation works on the ranks of the data, not the actual values. Each data point is assigned a rank, and the correlation is calculated based on these ranks. Like Pearson correlation, the Spearman correlation coefficient ranges from -1 to 1:

ρ=1: Perfect positive monotonic relationship. As one variable increases, the other increases as well.
ρ=−1: Perfect negative monotonic relationship. As one variable increases, the other decreases.
ρ=0: No monotonic relationship between the variables.

Spearman rank correlation coefficient does not assume normality or linearity. It can be applied to ordinal, interval, or ratio data. Since Spearman uses ranks, it is more robust to outliers and non-detects compared to Pearson correlation.

To calculate the Spearman rank correlation coefficient in Pandas, we just simply need to change the method argument to 'spearman':

In [21]:

Copied!

spearman_corr = df_AQS['T_DAILY_MEAN'].corr(df_AQS['Daily Max 8-hour Ozone Concentration'], method='spearman')
spearman_corr = df_AQS['T_DAILY_MEAN'].corr(df_AQS['Daily Max 8-hour Ozone Concentration'], method='spearman')

In [22]:

Copied!

spearman_corr
spearman_corr

Out[22]:

np.float64(0.3031817198225167)

2.4 Correlation Matrix¶

Correlation matrix is a numerical representation (a matrix of the correlation coefficient (r)) showing the strength of the relationship among all the variables in your dataset.

In [23]:

Copied!

# Calculate Pearson correlation matrix for multiple numeric columns
corr_matrix = df_AQS[['T_DAILY_MAX','T_DAILY_MIN','T_DAILY_MEAN','Daily Max 8-hour Ozone Concentration']].corr(method='pearson', numeric_only = True)

# We can visualize the correlation metrics by setting the style of the output.
corr_matrix.style.background_gradient(cmap='Blues')
# Calculate Pearson correlation matrix for multiple numeric columns
corr_matrix = df_AQS[['T_DAILY_MAX','T_DAILY_MIN','T_DAILY_MEAN','Daily Max 8-hour Ozone Concentration']].corr(method='pearson', numeric_only = True)

# We can visualize the correlation metrics by setting the style of the output.
corr_matrix.style.background_gradient(cmap='Blues')

Out[23]:

	T_DAILY_MAX	T_DAILY_MIN	T_DAILY_MEAN	Daily Max 8-hour Ozone Concentration
T_DAILY_MAX	1.000000	0.905145	0.978251	0.379406
T_DAILY_MIN	0.905145	1.000000	0.973623	0.206968
T_DAILY_MEAN	0.978251	0.973623	1.000000	0.304737
Daily Max 8-hour Ozone Concentration	0.379406	0.206968	0.304737	1.000000

3. Linear Regression¶

There are multiple ways to conduct linear regression in Python. Here we use statsmodels library in Python, which provides classes and functions for the estimation of many different statistical models, as well as for conducting statistical tests, and statistical data exploration.

In [24]:

Copied!

import statsmodels.api as sm
import statsmodels.api as sm

In [25]:

Copied!

# Define the independent (X) and dependent (y) variables
X = df_AQS['T_DAILY_MAX']  # Independent variable
y = df_AQS['Daily Max 8-hour Ozone Concentration']*1000.  # Dependent variable
# Define the independent (X) and dependent (y) variables
X = df_AQS['T_DAILY_MAX']  # Independent variable
y = df_AQS['Daily Max 8-hour Ozone Concentration']*1000.  # Dependent variable

In [26]:

Copied!

# Add a constant (intercept) to X
X_with_const = sm.add_constant(X)

model = sm.OLS(y, X_with_const).fit()
# Add a constant (intercept) to X
X_with_const = sm.add_constant(X)

model = sm.OLS(y, X_with_const).fit()

In [27]:

Copied!

model.summary()
model.summary()

Out[27]:

OLS Regression Results
Dep. Variable:	Daily Max 8-hour Ozone Concentration	R-squared:	0.144
Model:	OLS	Adj. R-squared:	0.141
Method:	Least Squares	F-statistic:	56.67
Date:	Sun, 29 Sep 2024	Prob (F-statistic):	4.76e-13
Time:	15:19:23	Log-Likelihood:	-1226.9
No. Observations:	339	AIC:	2458.
Df Residuals:	337	BIC:	2465.
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
const	32.7306	0.903	36.251	0.000	30.955	34.507
T_DAILY_MAX	0.3396	0.045	7.528	0.000	0.251	0.428

Omnibus:	2.910	Durbin-Watson:	0.972
Prob(Omnibus):	0.233	Jarque-Bera (JB):	2.221
Skew:	0.007	Prob(JB):	0.329
Kurtosis:	2.604	Cond. No.	36.8

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In [28]:

Copied!





# Extract the intercept and slope
intercept, slope = model.params

# Print the slope and intercept
print(f"Intercept: {intercept}")
print(f"Slope (Coefficient): {slope}")
# Extract the intercept and slope
intercept, slope = model.params

# Print the slope and intercept
print(f"Intercept: {intercept}")
print(f"Slope (Coefficient): {slope}")

Intercept: 32.73055575275309
Slope (Coefficient): 0.33960906163846516

Exercise 3: Based on what you learned from the video, explain the linear regression results summary:¶

A. What do the slope and intercept mean here? What are the their units?

B. What does R-squared mean here?

C. Using the linear relationship, predict the level of ozone at 30 ˚C.

D. What is degree of freedom of this linear model?

In [ ]:

4. Multiple Linear Regression¶

If you have multiple independent variables, the process is similar. You just need to include more columns in your X.

In [29]:

Copied!

# Define the independent (X) and dependent (y) variables
X = df_AQS[['T_DAILY_MAX','RH_DAILY_MAX']]  # Independent variables
y = df_AQS['Daily Max 8-hour Ozone Concentration']*1000.  # Dependent variable
# Define the independent (X) and dependent (y) variables
X = df_AQS[['T_DAILY_MAX','RH_DAILY_MAX']]  # Independent variables
y = df_AQS['Daily Max 8-hour Ozone Concentration']*1000.  # Dependent variable

In [30]:

Copied!





# Add a constant (intercept) to X
X_with_const = sm.add_constant(X)

# Fit the linear regression model
model = sm.OLS(y, X_with_const).fit()

# Get the regression results summary
model.summary()
# Add a constant (intercept) to X
X_with_const = sm.add_constant(X)

# Fit the linear regression model
model = sm.OLS(y, X_with_const).fit()

# Get the regression results summary
model.summary()

Out[30]:

OLS Regression Results
Dep. Variable:	Daily Max 8-hour Ozone Concentration	R-squared:	0.205
Model:	OLS	Adj. R-squared:	0.200
Method:	Least Squares	F-statistic:	43.34
Date:	Sun, 29 Sep 2024	Prob (F-statistic):	1.80e-17
Time:	15:19:53	Log-Likelihood:	-1214.4
No. Observations:	339	AIC:	2435.
Df Residuals:	336	BIC:	2446.
Df Model:	2
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
const	59.2369	5.287	11.203	0.000	48.836	69.638
T_DAILY_MAX	0.4494	0.049	9.246	0.000	0.354	0.545
RH_DAILY_MAX	-0.3274	0.064	-5.083	0.000	-0.454	-0.201

Omnibus:	3.131	Durbin-Watson:	1.031
Prob(Omnibus):	0.209	Jarque-Bera (JB):	2.347
Skew:	0.017	Prob(JB):	0.309
Kurtosis:	2.594	Cond. No.	988.

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Exercise 4: Based on what you learned from the video, explain the multiple linear regression results summary:¶

A. What do the coefficients mean here? What are the their units?

B. What is degree of freedom of this model?

C. What is the difference between R-squared and Adj. R-squared?

5. Assignment: Apply correlation and regression analysis to the datasets you found in Week 2 (or any other datasets that you use).¶

Note: If you are both careful and lucky (i.e. you are lucky enough to choose the right data set), you should be able to meld this weekly assignment into your final paper.

5.1 In one sentence, explain the purpose of the analysis you will do.¶

5.2 Correlation analysis¶

In [ ]:

5.3 Linear regression analysis¶

In [ ]:

Week 6: Correlation and Regressions in Python¶

1. Linking Datasets in Python¶

Exercise 1.1: Combine the two dataframes by adding 'Daily Max 8-hour Ozone Concentration' from df_AQS to df_weather.¶

Exercise 1.2: Use describe to get the basic statistics of the combined dataset df_AQS and df_weather. Can you tell what's the difference between these two dataframes?¶

2. Correlation Analysis¶

2.1 Scatter Plots¶

2.2 Pearson Correlation Coefficient¶

Exercise 2: Using the above formula to calculate the Pearson Correlation Coefficient r between 'Daily Max 8-hour Ozone Concentration' and 'T_DAILY_MEAN' in df_AQS¶

2.3 Spearman Rank Correlation Coefficient¶

2.4 Correlation Matrix¶

3. Linear Regression¶

Exercise 3: Based on what you learned from the video, explain the linear regression results summary:¶

4. Multiple Linear Regression¶

Exercise 4: Based on what you learned from the video, explain the multiple linear regression results summary:¶

5. Assignment: Apply correlation and regression analysis to the datasets you found in Week 2 (or any other datasets that you use).¶

5.1 In one sentence, explain the purpose of the analysis you will do.¶

5.2 Correlation analysis¶

5.3 Linear regression analysis¶

Exercise 1.1: Combine the two dataframes by adding 'Daily Max 8-hour Ozone Concentration' from `df_AQS` to `df_weather`.¶

Exercise 1.2: Use `describe` to get the basic statistics of the combined dataset df_AQS and df_weather. Can you tell what's the difference between these two dataframes?¶

Exercise 2: Using the above formula to calculate the Pearson Correlation Coefficient r between 'Daily Max 8-hour Ozone Concentration' and 'T_DAILY_MEAN' in `df_AQS`¶