Week 7: Time series analysis¶
Learning Goals:
- Understand the Datetime objects in Python.
- Analyze the four components of time series: seasonality, cycle, trend and variations.
First, import python packages.
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
1. Working with Datetime objects¶
Like the data we are working with in this exercise, many environmental datasets include timed records. The standard datetime library is the primary way of manipulating dates and times in Python, but there are additional third-party packages that provide additional support.
A few worth exploring are dateutil, an extension of the datetime library useful for parsing timestamps, and pytz, which provides a smooth way of tackling time zones.
Though we will not review datetime objects in depth here, it is useful to understand the basics of how to deal with datetime objects in Python as this is a main component of time series data.
Here we will introduce a few pandas functions built on the datetime library to handle datetime objects.
1.1 Setting date range¶
The pd.date_range() function allows you to build a DatetimeIndex with a fixed frequency. This can be done by specifying a start date and an end date as follows:
# Define a range of dates. By default, the frequency is daily.
pd.date_range('4/1/2017','4/30/2017')
DatetimeIndex(['2017-04-01', '2017-04-02', '2017-04-03', '2017-04-04', '2017-04-05', '2017-04-06', '2017-04-07', '2017-04-08', '2017-04-09', '2017-04-10', '2017-04-11', '2017-04-12', '2017-04-13', '2017-04-14', '2017-04-15', '2017-04-16', '2017-04-17', '2017-04-18', '2017-04-19', '2017-04-20', '2017-04-21', '2017-04-22', '2017-04-23', '2017-04-24', '2017-04-25', '2017-04-26', '2017-04-27', '2017-04-28', '2017-04-29', '2017-04-30'], dtype='datetime64[ns]', freq='D')
# Specify start and end, monthly frequency
pd.date_range('4/1/2017','4/30/2018', freq = 'ME')
DatetimeIndex(['2017-04-30', '2017-05-31', '2017-06-30', '2017-07-31', '2017-08-31', '2017-09-30', '2017-10-31', '2017-11-30', '2017-12-31', '2018-01-31', '2018-02-28', '2018-03-31', '2018-04-30'], dtype='datetime64[ns]', freq='ME')
# Specify start and end, 5min frequency
# datetime64 is 64-bit integer, which represents an offset from 1970-01-01T00:00:00
pd.date_range('4/1/2017','4/30/2018', freq = '5min')
DatetimeIndex(['2017-04-01 00:00:00', '2017-04-01 00:05:00', '2017-04-01 00:10:00', '2017-04-01 00:15:00', '2017-04-01 00:20:00', '2017-04-01 00:25:00', '2017-04-01 00:30:00', '2017-04-01 00:35:00', '2017-04-01 00:40:00', '2017-04-01 00:45:00', ... '2018-04-29 23:15:00', '2018-04-29 23:20:00', '2018-04-29 23:25:00', '2018-04-29 23:30:00', '2018-04-29 23:35:00', '2018-04-29 23:40:00', '2018-04-29 23:45:00', '2018-04-29 23:50:00', '2018-04-29 23:55:00', '2018-04-30 00:00:00'], dtype='datetime64[ns]', length=113473, freq='5min')
1.2 Dealing with existing timestamps¶
Ideally the tabular data we use already have date and time information. In Pandas, we could set the column type as datetime object.
Here we're going to use daily weather data at Millbrook, NY site. We used this dataset last week.
df = pd.read_csv('Millbrook_NY_daily_weather.csv')
df.head()
LST_DATE | WBANNO | CRX_VN | LONGITUDE | LATITUDE | T_DAILY_MAX | T_DAILY_MIN | T_DAILY_MEAN | T_DAILY_AVG | P_DAILY_CALC | ... | SOIL_MOISTURE_10_DAILY | SOIL_MOISTURE_20_DAILY | SOIL_MOISTURE_50_DAILY | SOIL_MOISTURE_100_DAILY | SOIL_TEMP_5_DAILY | SOIL_TEMP_10_DAILY | SOIL_TEMP_20_DAILY | SOIL_TEMP_50_DAILY | SOIL_TEMP_100_DAILY | Unnamed: 28 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-01-01 | 64756 | 2.422 | -73.74 | 41.79 | 3.4 | -0.5 | 1.5 | 1.3 | 0.0 | ... | 0.233 | 0.204 | 0.155 | 0.147 | 4.2 | 4.4 | 5.1 | 6.0 | 7.6 | NaN |
1 | 2016-01-02 | 64756 | 2.422 | -73.74 | 41.79 | 2.9 | -3.6 | -0.4 | -0.3 | 0.0 | ... | 0.227 | 0.199 | 0.152 | 0.144 | 2.8 | 3.1 | 4.2 | 5.7 | 7.4 | NaN |
2 | 2016-01-03 | 64756 | 2.422 | -73.74 | 41.79 | 5.1 | -1.8 | 1.6 | 1.1 | 0.0 | ... | 0.223 | 0.196 | 0.151 | 0.141 | 2.6 | 2.8 | 3.8 | 5.2 | 7.2 | NaN |
3 | 2016-01-04 | 64756 | 2.422 | -73.74 | 41.79 | 0.5 | -14.4 | -6.9 | -7.5 | 0.0 | ... | 0.220 | 0.194 | 0.148 | 0.139 | 1.7 | 2.1 | 3.4 | 4.9 | 6.9 | NaN |
4 | 2016-01-05 | 64756 | 2.422 | -73.74 | 41.79 | -5.2 | -15.5 | -10.3 | -11.7 | 0.0 | ... | 0.213 | 0.191 | 0.148 | 0.138 | 0.4 | 0.9 | 2.4 | 4.3 | 6.6 | NaN |
5 rows × 29 columns
df.columns
Index(['LST_DATE', 'WBANNO', 'CRX_VN', 'LONGITUDE', 'LATITUDE', 'T_DAILY_MAX', 'T_DAILY_MIN', 'T_DAILY_MEAN', 'T_DAILY_AVG', 'P_DAILY_CALC', 'SOLARAD_DAILY', 'SUR_TEMP_DAILY_TYPE', 'SUR_TEMP_DAILY_MAX', 'SUR_TEMP_DAILY_MIN', 'SUR_TEMP_DAILY_AVG', 'RH_DAILY_MAX', 'RH_DAILY_MIN', 'RH_DAILY_AVG', 'SOIL_MOISTURE_5_DAILY', 'SOIL_MOISTURE_10_DAILY', 'SOIL_MOISTURE_20_DAILY', 'SOIL_MOISTURE_50_DAILY', 'SOIL_MOISTURE_100_DAILY', 'SOIL_TEMP_5_DAILY', 'SOIL_TEMP_10_DAILY', 'SOIL_TEMP_20_DAILY', 'SOIL_TEMP_50_DAILY', 'SOIL_TEMP_100_DAILY', 'Unnamed: 28'], dtype='object')
Which of the columns is likely to store the date and time information?
df['LST_DATE']
0 2016-01-01 1 2016-01-02 2 2016-01-03 3 2016-01-04 4 2016-01-05 ... 2552 2022-12-27 2553 2022-12-28 2554 2022-12-29 2555 2022-12-30 2556 2022-12-31 Name: LST_DATE, Length: 2557, dtype: object
While the values certainly resemble datetime objects, they are stored in pandas as "objects," which basically means that pandas doesn't recognize the data type – it doesn't know how to handle them.
Using the pd.to_datetime() function, we can convert this column to datetime objects:
pd.to_datetime(df['LST_DATE'])
0 2016-01-01 1 2016-01-02 2 2016-01-03 3 2016-01-04 4 2016-01-05 ... 2552 2022-12-27 2553 2022-12-28 2554 2022-12-29 2555 2022-12-30 2556 2022-12-31 Name: LST_DATE, Length: 2557, dtype: datetime64[ns]
# Set the LST_DATE as datetime object.
df['LST_DATE'] = pd.to_datetime(df['LST_DATE'])
# We can also set LST_DATE as datetime object so by setting the parse_dates when read in the csv data:
df = pd.read_csv('Millbrook_NY_daily_weather.csv', parse_dates = ['LST_DATE'])
# Set the Date column as index:
df = df.set_index('LST_DATE')
# Now it's more intuitive to interpret the data:
df['T_DAILY_MEAN']
LST_DATE 2016-01-01 1.5 2016-01-02 -0.4 2016-01-03 1.6 2016-01-04 -6.9 2016-01-05 -10.3 ... 2022-12-27 -4.4 2022-12-28 0.7 2022-12-29 4.4 2022-12-30 10.7 2022-12-31 7.9 Name: T_DAILY_MEAN, Length: 2557, dtype: float64
1.3 Plotting time series in Pandas¶
After setting the date as the index, it's very simple to make time series plots in Pandas. You don't even need to set the x-axis label. Pandas will automatically decide the date frequency that can best show the date information!
df['T_DAILY_MEAN'].plot()
<Axes: xlabel='LST_DATE'>
Since we have multiple years of data, Pandas will automatically choose to show the year only.
Exercise 1: Make a time series plot of daily maximum and minmum temperature. You should see two lines.¶
1.4 Access datetime attributes¶
Now that we have a DatetimeIndex, we can access specific attributes of the datetime objects like the year, day, hour, etc. To do this, we add the desired time period using dot notation: df.index.attribute. For a full list of attributes, see the pd.DatetimeIndex documentation. For example:
# Get the month of each record
df.index.month
Index([ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... 12, 12, 12, 12, 12, 12, 12, 12, 12, 12], dtype='int32', name='LST_DATE', length=2557)
# Get the year of each record, and assign this to a new column
df['year'] = df.index.year
df
WBANNO | CRX_VN | LONGITUDE | LATITUDE | T_DAILY_MAX | T_DAILY_MIN | T_DAILY_MEAN | T_DAILY_AVG | P_DAILY_CALC | SOLARAD_DAILY | ... | SOIL_MOISTURE_20_DAILY | SOIL_MOISTURE_50_DAILY | SOIL_MOISTURE_100_DAILY | SOIL_TEMP_5_DAILY | SOIL_TEMP_10_DAILY | SOIL_TEMP_20_DAILY | SOIL_TEMP_50_DAILY | SOIL_TEMP_100_DAILY | Unnamed: 28 | year | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
LST_DATE | |||||||||||||||||||||
2016-01-01 | 64756 | 2.422 | -73.74 | 41.79 | 3.4 | -0.5 | 1.5 | 1.3 | 0.0 | 1.69 | ... | 0.204 | 0.155 | 0.147 | 4.2 | 4.4 | 5.1 | 6.0 | 7.6 | NaN | 2016 |
2016-01-02 | 64756 | 2.422 | -73.74 | 41.79 | 2.9 | -3.6 | -0.4 | -0.3 | 0.0 | 6.25 | ... | 0.199 | 0.152 | 0.144 | 2.8 | 3.1 | 4.2 | 5.7 | 7.4 | NaN | 2016 |
2016-01-03 | 64756 | 2.422 | -73.74 | 41.79 | 5.1 | -1.8 | 1.6 | 1.1 | 0.0 | 5.69 | ... | 0.196 | 0.151 | 0.141 | 2.6 | 2.8 | 3.8 | 5.2 | 7.2 | NaN | 2016 |
2016-01-04 | 64756 | 2.422 | -73.74 | 41.79 | 0.5 | -14.4 | -6.9 | -7.5 | 0.0 | 9.17 | ... | 0.194 | 0.148 | 0.139 | 1.7 | 2.1 | 3.4 | 4.9 | 6.9 | NaN | 2016 |
2016-01-05 | 64756 | 2.422 | -73.74 | 41.79 | -5.2 | -15.5 | -10.3 | -11.7 | 0.0 | 9.34 | ... | 0.191 | 0.148 | 0.138 | 0.4 | 0.9 | 2.4 | 4.3 | 6.6 | NaN | 2016 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2022-12-27 | 64756 | 2.622 | -73.74 | 41.79 | -0.8 | -8.0 | -4.4 | -3.8 | 0.0 | 4.00 | ... | NaN | 0.164 | 0.157 | -0.4 | -0.2 | 0.5 | 2.2 | 4.0 | NaN | 2022 |
2022-12-28 | 64756 | 2.622 | -73.74 | 41.79 | 7.4 | -6.1 | 0.7 | 1.3 | 0.0 | 7.73 | ... | NaN | 0.162 | 0.156 | -0.4 | -0.3 | 0.4 | 2.1 | 3.8 | NaN | 2022 |
2022-12-29 | 64756 | 2.622 | -73.74 | 41.79 | 10.7 | -1.8 | 4.4 | 5.0 | 0.0 | 6.66 | ... | NaN | 0.159 | 0.155 | -0.3 | -0.3 | 0.3 | 1.9 | 3.7 | NaN | 2022 |
2022-12-30 | 64756 | 2.622 | -73.74 | 41.79 | 16.6 | 4.9 | 10.7 | 10.3 | 0.0 | 5.39 | ... | NaN | 0.159 | 0.154 | -0.2 | -0.2 | 0.3 | 1.8 | 3.6 | NaN | 2022 |
2022-12-31 | 64756 | 2.622 | -73.74 | 41.79 | 13.2 | 2.7 | 7.9 | 10.2 | 5.0 | 1.25 | ... | NaN | 0.160 | 0.153 | -0.1 | -0.2 | 0.3 | 1.8 | 3.4 | NaN | 2022 |
2557 rows × 29 columns
# Get the unique year values
df.index.year.unique()
Index([2016, 2017, 2018, 2019, 2020, 2021, 2022], dtype='int32', name='LST_DATE')
# Sometimes you may want to know the day of year:
df.index.dayofyear
Index([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, ... 356, 357, 358, 359, 360, 361, 362, 363, 364, 365], dtype='int32', name='LST_DATE', length=2557)
# You can also directly get the day of week, which is difficult to program on our own.
df.index.dayofweek
Index([4, 5, 6, 0, 1, 2, 3, 4, 5, 6, ... 3, 4, 5, 6, 0, 1, 2, 3, 4, 5], dtype='int32', name='LST_DATE', length=2557)
Exercise 2: Create a column in df
that represents the month of the date.¶
2. Temporal resampling¶
Another common operation we apply to time series is to change the resolution of a dataset by resampling in time. Pandas exposes this through the resample function. The resample periods are specified using pandas offset index syntax.
2.1 Resample function¶
Below we resample the dataframe by taking the mean over each month.
# First, we need to remove the columns that do not have numerical values.
df = df.drop(columns = 'SUR_TEMP_DAILY_TYPE')
df.resample('ME').mean().plot(y='T_DAILY_MEAN', marker='o')
<Axes: xlabel='LST_DATE'>
Resampling can be applied to the entire dataframe.
df_mm = df.resample('ME').mean()
df_mm[['T_DAILY_MIN', 'T_DAILY_MEAN', 'T_DAILY_MAX']].plot()
<Axes: xlabel='LST_DATE'>
Just like with groupby, we can apply any aggregation function to our resample operation.
# Resample by maximum values.
df.resample('ME').max().plot(y='T_DAILY_MAX', marker='o')
<Axes: xlabel='LST_DATE'>
2.2 Rolling Operations¶
Rolling mean (or moving average) computes the average of data points in a sliding window over the series. It's typically used to smooth out short-term fluctuations and highlight longer-term trends.
# Setting the window size to be 30. Center = True means the window is centered at a given index. For example, on 10/15/2022, it will return the mean from 10/01 to 10/30.
df.rolling(30, center=True).T_DAILY_MEAN.mean().plot()
<Axes: xlabel='LST_DATE'>
You may notice that there are some breaks in the time series. If there is missing value in the rolling window, pandas will return NaN for the rolling mean.
# We find 12 observations with missing values for T_DAILY_MEAN.
df.loc[(df.T_DAILY_MEAN.isnull())]
WBANNO | CRX_VN | LONGITUDE | LATITUDE | T_DAILY_MAX | T_DAILY_MIN | T_DAILY_MEAN | T_DAILY_AVG | P_DAILY_CALC | SOLARAD_DAILY | ... | SOIL_MOISTURE_20_DAILY | SOIL_MOISTURE_50_DAILY | SOIL_MOISTURE_100_DAILY | SOIL_TEMP_5_DAILY | SOIL_TEMP_10_DAILY | SOIL_TEMP_20_DAILY | SOIL_TEMP_50_DAILY | SOIL_TEMP_100_DAILY | Unnamed: 28 | year | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
LST_DATE | |||||||||||||||||||||
2017-10-04 | 64756 | 2.622 | -73.74 | 41.79 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2017 |
2018-10-13 | 64756 | 2.622 | -73.74 | 41.79 | NaN | NaN | NaN | NaN | 4.6 | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2018 |
2019-08-10 | 64756 | 2.622 | -73.74 | 41.79 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2019 |
2019-08-11 | 64756 | -9.000 | -73.74 | 41.79 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2019 |
2019-08-12 | 64756 | -9.000 | -73.74 | 41.79 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2019 |
2019-08-13 | 64756 | -9.000 | -73.74 | 41.79 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2019 |
2019-08-14 | 64756 | -9.000 | -73.74 | 41.79 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2019 |
2019-08-15 | 64756 | -9.000 | -73.74 | 41.79 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2019 |
2019-08-16 | 64756 | 2.622 | -73.74 | 41.79 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2019 |
2019-09-25 | 64756 | 2.622 | -73.74 | 41.79 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2019 |
2021-01-04 | 64756 | 2.622 | -73.74 | 41.79 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2021 |
2021-11-16 | 64756 | 2.622 | -73.74 | 41.79 | NaN | NaN | NaN | NaN | 0.0 | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2021 |
12 rows × 28 columns
Instead of setting the number of observations, we could set the time period of each window. Each window will be a variable sized based on the observations included in the time-period.
# Note that days with missing values will be skipped.
df.rolling('30D', center=True).T_DAILY_MEAN.mean().plot()
<Axes: xlabel='LST_DATE'>
Exercise 3: Calculate and plot 10-day rolling mean of daily max temperature T_DAILY_MAX
.¶
2.3 Filling missing data¶
Sometimes you may want to fill in the missing values. There are several ways to fill missing data in Pandas.
# Let's first extract a subset of the data
df_sub = df.loc['2017-01-01':'2017-12-31']
# Note there are data breaks in February.
df_sub['SOIL_MOISTURE_10_DAILY'].plot()
<Axes: xlabel='LST_DATE'>
# Fill values forward using ffill function
df_sub.ffill()['SOIL_MOISTURE_10_DAILY'].plot(label = 'Forward')
# Fill values backward using bfill function
df_sub.bfill()['SOIL_MOISTURE_10_DAILY'].plot(label = 'Backward')
# Using interpolate function:
df_sub['SOIL_MOISTURE_10_DAILY'].interpolate('linear').plot(label = 'Linear Interploation')
plt.legend()
<matplotlib.legend.Legend at 0x3318f06d0>
Exercise 4: Based on the above figure, explain what is the difference among ffill
, bfill
and interpolate
.¶
3. Analyze the seasonality and cyclic patterns of time series¶
Recall the four major components of time series: seasonality, cycle, trend and variations. Many environmental datasets we deal with (e.g., temperature, precipitation, air quality) vary seasonally. A common way to analyze such data is to create a "climatology," which contains the average values in each month or day of the year. We can do this easily with groupby. Recall that df.index is a pandas DateTimeIndex object.
monthly_climatology = df.groupby(df.index.month).mean(numeric_only=True)
monthly_climatology
WBANNO | CRX_VN | LONGITUDE | LATITUDE | T_DAILY_MAX | T_DAILY_MIN | T_DAILY_MEAN | T_DAILY_AVG | P_DAILY_CALC | SOLARAD_DAILY | ... | SOIL_MOISTURE_20_DAILY | SOIL_MOISTURE_50_DAILY | SOIL_MOISTURE_100_DAILY | SOIL_TEMP_5_DAILY | SOIL_TEMP_10_DAILY | SOIL_TEMP_20_DAILY | SOIL_TEMP_50_DAILY | SOIL_TEMP_100_DAILY | Unnamed: 28 | year | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
LST_DATE | |||||||||||||||||||||
1 | 64756.0 | 2.564857 | -73.74 | 41.79 | 2.200000 | -7.550463 | -2.678241 | -2.442130 | 2.217130 | 5.655139 | ... | 0.230433 | 0.164727 | 0.166809 | 0.393056 | 0.472685 | 0.963889 | 1.983796 | 3.371759 | NaN | 2019.000000 |
2 | 64756.0 | 2.564424 | -73.74 | 41.79 | 4.866162 | -5.901010 | -0.519192 | -0.220202 | 3.449495 | 8.359444 | ... | 0.219145 | 0.163520 | 0.165425 | 0.557071 | 0.518687 | 0.707071 | 1.256566 | 2.156061 | NaN | 2018.989899 |
3 | 64756.0 | 2.564857 | -73.74 | 41.79 | 9.015668 | -2.634101 | 3.184793 | 3.370046 | 2.426728 | 12.813917 | ... | 0.224130 | 0.168618 | 0.166180 | 3.385714 | 3.270507 | 3.157604 | 3.182488 | 3.337788 | NaN | 2019.000000 |
4 | 64756.0 | 2.564857 | -73.74 | 41.79 | 14.930476 | 1.948095 | 8.438095 | 8.733810 | 3.217143 | 14.977381 | ... | 0.235376 | 0.165395 | 0.165433 | 9.685714 | 9.460476 | 8.950000 | 8.119524 | 7.199524 | NaN | 2019.000000 |
5 | 64756.0 | 2.564857 | -73.74 | 41.79 | 20.996774 | 8.094009 | 14.548848 | 14.772811 | 3.182949 | 17.912673 | ... | 0.221042 | 0.156848 | 0.161083 | 16.721198 | 16.471889 | 15.606019 | 14.184793 | 12.509217 | NaN | 2019.000000 |
6 | 64756.0 | 2.564857 | -73.74 | 41.79 | 25.874286 | 12.033810 | 18.952857 | 19.285714 | 2.290000 | 21.610095 | ... | 0.162114 | 0.136386 | 0.153190 | 22.399524 | 22.177619 | 21.112381 | 19.530476 | 17.661429 | NaN | 2019.000000 |
7 | 64756.0 | 2.564857 | -73.74 | 41.79 | 29.040092 | 15.894470 | 22.465899 | 22.322120 | 3.880184 | 20.864101 | ... | 0.113350 | 0.118535 | 0.139206 | 25.527189 | 25.406912 | 24.316204 | 22.814286 | 21.129630 | NaN | 2019.000000 |
8 | 64756.0 | 2.297069 | -73.74 | 41.79 | 28.139048 | 15.543810 | 21.838571 | 21.615714 | 3.969048 | 18.131429 | ... | 0.128214 | 0.122652 | 0.136627 | 24.912857 | 24.918095 | 24.244762 | 23.358571 | 22.286190 | NaN | 2019.000000 |
9 | 64756.0 | 2.564857 | -73.74 | 41.79 | 23.720096 | 11.202871 | 17.460287 | 17.344019 | 3.948804 | 14.033301 | ... | 0.151890 | 0.127550 | 0.141841 | 20.640191 | 20.736364 | 20.594258 | 20.666986 | 20.554067 | NaN | 2019.000000 |
10 | 64756.0 | 2.590664 | -73.74 | 41.79 | 17.773488 | 5.738140 | 11.753023 | 11.740465 | 3.637963 | 9.193674 | ... | 0.187633 | 0.141386 | 0.144750 | 14.663256 | 14.799070 | 15.074419 | 15.892558 | 16.645581 | NaN | 2019.000000 |
11 | 64756.0 | 2.593429 | -73.74 | 41.79 | 10.377512 | -1.277990 | 4.547368 | 4.720574 | 3.301905 | 6.567895 | ... | 0.237167 | 0.166742 | 0.161895 | 7.128708 | 7.302392 | 7.955288 | 9.385167 | 10.990909 | NaN | 2019.000000 |
12 | 64756.0 | 2.593429 | -73.74 | 41.79 | 4.488018 | -4.917972 | -0.217512 | 0.055760 | 3.121198 | 4.639954 | ... | 0.242934 | 0.171530 | 0.170230 | 2.254839 | 2.385714 | 2.946083 | 4.288479 | 5.912442 | NaN | 2019.000000 |
12 rows × 28 columns
Each row in this new dataframe respresents the average values for the months (1=January, 2=February, etc.)
We can apply more customized aggregations, as with any groupby operation. Below we keep the mean of the mean, max of the max, and min of the min for the temperature measurements.
monthly_T_climatology = df.groupby(df.index.month).aggregate({'T_DAILY_MEAN': 'mean',
'T_DAILY_MAX': 'max',
'T_DAILY_MIN': 'min'})
monthly_T_climatology
T_DAILY_MEAN | T_DAILY_MAX | T_DAILY_MIN | |
---|---|---|---|
LST_DATE | |||
1 | -2.678241 | 19.8 | -26.0 |
2 | -0.519192 | 24.9 | -24.7 |
3 | 3.184793 | 26.8 | -17.4 |
4 | 8.438095 | 30.6 | -11.3 |
5 | 14.548848 | 33.4 | -3.1 |
6 | 18.952857 | 34.5 | 1.5 |
7 | 22.465899 | 36.2 | 8.2 |
8 | 21.838571 | 36.5 | 6.0 |
9 | 17.460287 | 32.7 | -1.6 |
10 | 11.753023 | 29.9 | -5.9 |
11 | 4.547368 | 24.4 | -15.9 |
12 | -0.217512 | 17.9 | -21.8 |
monthly_T_climatology.plot(marker='o')
<Axes: xlabel='LST_DATE'>
If we want to do it on a finer scale, we can group by day of year.
daily_T_climatology = df.groupby(df.index.dayofyear).aggregate({'T_DAILY_MEAN': 'mean',
'T_DAILY_MAX': 'max',
'T_DAILY_MIN': 'min'})
daily_T_climatology.plot(marker='.')
<Axes: xlabel='LST_DATE'>
Exercise 5: Calculate and plot the seasonal cycle of solar radiation SOLARAD_DAILY
at monthly scale.¶
4. Analyze variations or anomalies¶
A common mode of time series analysis is to remove the seasonality or cyclic patterns from a signal to focus only on the variations or anomaly values. This can be accomplished with transformation.
def standardize(x):
return (x - x.mean())/x.std()
anomaly = df.groupby(df.index.month).transform(standardize)
anomaly.plot(y='T_DAILY_MEAN')
<Axes: xlabel='LST_DATE'>
5. Trend Analysis¶
We learned about linear regression from previous week. We can similarly apply the linear regression to time series data to fit a linear trend.
import statsmodels.api as sm
To focus on the trend component, we could first resample the data to annual scale, so that the other components of the time series (seasonality and variations) do not affect the trends. Note here we just introduce the most basic trend analysis. Check out Venier et al. (2012) for some advanced trend analysis models used in Environmental Science research.
df_year = df.resample('YE').mean()
df_year.index
DatetimeIndex(['2016-12-31', '2017-12-31', '2018-12-31', '2019-12-31', '2020-12-31', '2021-12-31', '2022-12-31'], dtype='datetime64[ns]', name='LST_DATE', freq='YE-DEC')
statsmodels do not recognize datetime object, so we need to convert Datetime into a numerical variable like float.
# Define the independent (X) and dependent (y) variables
X = df_year.index.year # Independent variable is converted to year.
y = df_year['SOIL_MOISTURE_5_DAILY'] # Dependent variable
# Add a constant (intercept) to X
X_with_const = sm.add_constant(X)
# Note that the data have missing values. If we want to skip the missing values, we need to set missing to be 'drop'.
model = sm.OLS(y, X_with_const, missing = 'drop').fit()
pred_y = model.predict(X_with_const)
model.summary()
/Users/xjin/miniforge3/envs/esa_env/lib/python3.9/site-packages/statsmodels/stats/stattools.py:74: ValueWarning: omni_normtest is not valid with less than 8 observations; 7 samples were given. warn("omni_normtest is not valid with less than 8 observations; %i "
Dep. Variable: | SOIL_MOISTURE_5_DAILY | R-squared: | 0.379 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.255 |
Method: | Least Squares | F-statistic: | 3.056 |
Date: | Sun, 06 Oct 2024 | Prob (F-statistic): | 0.141 |
Time: | 15:32:06 | Log-Likelihood: | 16.947 |
No. Observations: | 7 | AIC: | -29.89 |
Df Residuals: | 5 | BIC: | -30.00 |
Df Model: | 1 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
const | -16.7584 | 9.704 | -1.727 | 0.145 | -41.703 | 8.186 |
x1 | 0.0084 | 0.005 | 1.748 | 0.141 | -0.004 | 0.021 |
Omnibus: | nan | Durbin-Watson: | 2.243 |
---|---|---|---|
Prob(Omnibus): | nan | Jarque-Bera (JB): | 0.522 |
Skew: | 0.306 | Prob(JB): | 0.770 |
Kurtosis: | 1.811 | Cond. No. | 2.04e+06 |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.04e+06. This might indicate that there are
strong multicollinearity or other numerical problems.
plt.plot(df_year.index,pred_y)
plt.plot(df_year.index,y)
[<matplotlib.lines.Line2D at 0x3366977f0>]
Exercise 6: In the above example, what is the long term trend of SOIL_MOISTURE? Is the trend statistically significant (p value < 0.05)?¶
6. Assignment: Conduct time series analysis for a dataset you've found.¶
If the dataset you used for correlation analysis happens to have a temporal dimension, this should be a good dataset to use. This way, you could continue melding this assignment into your final paper.
If you have problems finding a good dataset, you could use the ozone measurements we used last week (Millbrook_NY_daily_ozone_2022.csv).
6.1 Find out the column that has date or time information. Read in the file using Pandas. Set the parse_dates
to be the column(s) of the date and time.¶
6.2 Make a time series plot of one of the measurements columns.¶
6.3 Creat two new columns year
and month
to represent the year and month the data.¶
6.4 Calculate and plot the monthly climatology of the measurements.¶
6.5 Plot standardized measurements with seasonal cycle removed.¶
6.6 Calculate the long-term trend of the data. Report the trend and statistical significance (p value) of the trend.¶