Week 7: Time series analysis¶

Learning Goals:

Understand the Datetime objects in Python.
Analyze the four components of time series: seasonality, cycle, trend and variations.

First, import python packages.

In [1]:

Copied!

import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt

1. Working with Datetime objects¶

Like the data we are working with in this exercise, many environmental datasets include timed records. The standard datetime library is the primary way of manipulating dates and times in Python, but there are additional third-party packages that provide additional support.

A few worth exploring are dateutil, an extension of the datetime library useful for parsing timestamps, and pytz, which provides a smooth way of tackling time zones.

Though we will not review datetime objects in depth here, it is useful to understand the basics of how to deal with datetime objects in Python as this is a main component of time series data.

Here we will introduce a few pandas functions built on the datetime library to handle datetime objects.

1.1 Setting date range¶

The pd.date_range() function allows you to build a DatetimeIndex with a fixed frequency. This can be done by specifying a start date and an end date as follows:

In [3]:

Copied!

# Define a range of dates. By default, the frequency is daily. 
pd.date_range('4/1/2017','4/30/2017')
# Define a range of dates. By default, the frequency is daily. 
pd.date_range('4/1/2017','4/30/2017')

Out[3]:

DatetimeIndex(['2017-04-01', '2017-04-02', '2017-04-03', '2017-04-04',
               '2017-04-05', '2017-04-06', '2017-04-07', '2017-04-08',
               '2017-04-09', '2017-04-10', '2017-04-11', '2017-04-12',
               '2017-04-13', '2017-04-14', '2017-04-15', '2017-04-16',
               '2017-04-17', '2017-04-18', '2017-04-19', '2017-04-20',
               '2017-04-21', '2017-04-22', '2017-04-23', '2017-04-24',
               '2017-04-25', '2017-04-26', '2017-04-27', '2017-04-28',
               '2017-04-29', '2017-04-30'],
              dtype='datetime64[ns]', freq='D')

In [4]:

Copied!

# Specify start and end, monthly frequency

pd.date_range('4/1/2017','4/30/2018', freq = 'ME')
# Specify start and end, monthly frequency

pd.date_range('4/1/2017','4/30/2018', freq = 'ME')

Out[4]:

DatetimeIndex(['2017-04-30', '2017-05-31', '2017-06-30', '2017-07-31',
               '2017-08-31', '2017-09-30', '2017-10-31', '2017-11-30',
               '2017-12-31', '2018-01-31', '2018-02-28', '2018-03-31',
               '2018-04-30'],
              dtype='datetime64[ns]', freq='ME')

In [5]:

Copied!

# Specify start and end, 5min frequency
# datetime64 is 64-bit integer, which represents an offset from 1970-01-01T00:00:00

pd.date_range('4/1/2017','4/30/2018', freq = '5min')
# Specify start and end, 5min frequency
# datetime64 is 64-bit integer, which represents an offset from 1970-01-01T00:00:00

pd.date_range('4/1/2017','4/30/2018', freq = '5min')

Out[5]:

DatetimeIndex(['2017-04-01 00:00:00', '2017-04-01 00:05:00',
               '2017-04-01 00:10:00', '2017-04-01 00:15:00',
               '2017-04-01 00:20:00', '2017-04-01 00:25:00',
               '2017-04-01 00:30:00', '2017-04-01 00:35:00',
               '2017-04-01 00:40:00', '2017-04-01 00:45:00',
               ...
               '2018-04-29 23:15:00', '2018-04-29 23:20:00',
               '2018-04-29 23:25:00', '2018-04-29 23:30:00',
               '2018-04-29 23:35:00', '2018-04-29 23:40:00',
               '2018-04-29 23:45:00', '2018-04-29 23:50:00',
               '2018-04-29 23:55:00', '2018-04-30 00:00:00'],
              dtype='datetime64[ns]', length=113473, freq='5min')

1.2 Dealing with existing timestamps¶

Ideally the tabular data we use already have date and time information. In Pandas, we could set the column type as datetime object.

Here we're going to use daily weather data at Millbrook, NY site. We used this dataset last week.

In [6]:

Copied!

df = pd.read_csv('Millbrook_NY_daily_weather.csv')
df = pd.read_csv('Millbrook_NY_daily_weather.csv')

In [7]:

Copied!

df.head()
df.head()

Out[7]:

	LST_DATE	WBANNO	CRX_VN	LONGITUDE	LATITUDE	T_DAILY_MAX	T_DAILY_MIN	T_DAILY_MEAN	T_DAILY_AVG	...	SOIL_MOISTURE_10_DAILY	SOIL_MOISTURE_20_DAILY	SOIL_MOISTURE_50_DAILY	SOIL_MOISTURE_100_DAILY	SOIL_TEMP_5_DAILY	SOIL_TEMP_10_DAILY	SOIL_TEMP_20_DAILY	SOIL_TEMP_50_DAILY	SOIL_TEMP_100_DAILY	Unnamed: 28
0	2016-01-01	64756	2.422	-73.74	41.79	3.4	-0.5	1.5	1.3	...	0.233	0.204	0.155	0.147	4.2	4.4	5.1	6.0	7.6	NaN
1	2016-01-02	64756	2.422	-73.74	41.79	2.9	-3.6	-0.4	-0.3	...	0.227	0.199	0.152	0.144	2.8	3.1	4.2	5.7	7.4	NaN
2	2016-01-03	64756	2.422	-73.74	41.79	5.1	-1.8	1.6	1.1	...	0.223	0.196	0.151	0.141	2.6	2.8	3.8	5.2	7.2	NaN
3	2016-01-04	64756	2.422	-73.74	41.79	0.5	-14.4	-6.9	-7.5	...	0.220	0.194	0.148	0.139	1.7	2.1	3.4	4.9	6.9	NaN
4	2016-01-05	64756	2.422	-73.74	41.79	-5.2	-15.5	-10.3	-11.7	...	0.213	0.191	0.148	0.138	0.4	0.9	2.4	4.3	6.6	NaN

5 rows × 29 columns

In [8]:

Copied!

df.columns
df.columns

Out[8]:

Index(['LST_DATE', 'WBANNO', 'CRX_VN', 'LONGITUDE', 'LATITUDE', 'T_DAILY_MAX',
       'T_DAILY_MIN', 'T_DAILY_MEAN', 'T_DAILY_AVG', 'P_DAILY_CALC',
       'SOLARAD_DAILY', 'SUR_TEMP_DAILY_TYPE', 'SUR_TEMP_DAILY_MAX',
       'SUR_TEMP_DAILY_MIN', 'SUR_TEMP_DAILY_AVG', 'RH_DAILY_MAX',
       'RH_DAILY_MIN', 'RH_DAILY_AVG', 'SOIL_MOISTURE_5_DAILY',
       'SOIL_MOISTURE_10_DAILY', 'SOIL_MOISTURE_20_DAILY',
       'SOIL_MOISTURE_50_DAILY', 'SOIL_MOISTURE_100_DAILY',
       'SOIL_TEMP_5_DAILY', 'SOIL_TEMP_10_DAILY', 'SOIL_TEMP_20_DAILY',
       'SOIL_TEMP_50_DAILY', 'SOIL_TEMP_100_DAILY', 'Unnamed: 28'],
      dtype='object')

Which of the columns is likely to store the date and time information?

In [9]:

Copied!

df['LST_DATE']
df['LST_DATE']

Out[9]:

0       2016-01-01
1       2016-01-02
2       2016-01-03
3       2016-01-04
4       2016-01-05
           ...    
2552    2022-12-27
2553    2022-12-28
2554    2022-12-29
2555    2022-12-30
2556    2022-12-31
Name: LST_DATE, Length: 2557, dtype: object

While the values certainly resemble datetime objects, they are stored in pandas as "objects," which basically means that pandas doesn't recognize the data type – it doesn't know how to handle them.

Using the pd.to_datetime() function, we can convert this column to datetime objects:

In [10]:

Copied!

pd.to_datetime(df['LST_DATE'])
pd.to_datetime(df['LST_DATE'])

Out[10]:

0      2016-01-01
1      2016-01-02
2      2016-01-03
3      2016-01-04
4      2016-01-05
          ...    
2552   2022-12-27
2553   2022-12-28
2554   2022-12-29
2555   2022-12-30
2556   2022-12-31
Name: LST_DATE, Length: 2557, dtype: datetime64[ns]

In [11]:

Copied!

# Set the LST_DATE as datetime object. 
df['LST_DATE'] = pd.to_datetime(df['LST_DATE'])
# Set the LST_DATE as datetime object. 
df['LST_DATE'] = pd.to_datetime(df['LST_DATE'])

In [24]:

Copied!

# We can also set LST_DATE as datetime object so by setting the parse_dates when read in the csv data:

df = pd.read_csv('Millbrook_NY_daily_weather.csv', parse_dates = ['LST_DATE'])
# We can also set LST_DATE as datetime object so by setting the parse_dates when read in the csv data:

df = pd.read_csv('Millbrook_NY_daily_weather.csv', parse_dates = ['LST_DATE'])

In [25]:

Copied!

# Set the Date column as index:

df = df.set_index('LST_DATE')
# Set the Date column as index:

df = df.set_index('LST_DATE')

In [26]:

Copied!

# Now it's more intuitive to interpret the data:

df['T_DAILY_MEAN']
# Now it's more intuitive to interpret the data:

df['T_DAILY_MEAN']

Out[26]:

LST_DATE
2016-01-01     1.5
2016-01-02    -0.4
2016-01-03     1.6
2016-01-04    -6.9
2016-01-05   -10.3
              ... 
2022-12-27    -4.4
2022-12-28     0.7
2022-12-29     4.4
2022-12-30    10.7
2022-12-31     7.9
Name: T_DAILY_MEAN, Length: 2557, dtype: float64

1.3 Plotting time series in Pandas¶

After setting the date as the index, it's very simple to make time series plots in Pandas. You don't even need to set the x-axis label. Pandas will automatically decide the date frequency that can best show the date information!

In [27]:

Copied!

df['T_DAILY_MEAN'].plot()
df['T_DAILY_MEAN'].plot()

Out[27]:

<Axes: xlabel='LST_DATE'>

No description has been provided for this image

Since we have multiple years of data, Pandas will automatically choose to show the year only.

Exercise 1: Make a time series plot of daily maximum and minmum temperature. You should see two lines.¶

In [ ]:

1.4 Access datetime attributes¶

Now that we have a DatetimeIndex, we can access specific attributes of the datetime objects like the year, day, hour, etc. To do this, we add the desired time period using dot notation: df.index.attribute. For a full list of attributes, see the pd.DatetimeIndex documentation. For example:

In [28]:

Copied!

# Get the month of each record
df.index.month
# Get the month of each record
df.index.month

Out[28]:

Index([ 1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
       ...
       12, 12, 12, 12, 12, 12, 12, 12, 12, 12],
      dtype='int32', name='LST_DATE', length=2557)

In [29]:

Copied!

# Get the year of each record, and assign this to a new column

df['year'] = df.index.year
# Get the year of each record, and assign this to a new column

df['year'] = df.index.year

In [30]:

Copied!

df
df

Out[30]:

	WBANNO	CRX_VN	LONGITUDE	LATITUDE	T_DAILY_MAX	T_DAILY_MIN	T_DAILY_MEAN	T_DAILY_AVG	P_DAILY_CALC	SOLARAD_DAILY	...	SOIL_MOISTURE_20_DAILY	SOIL_MOISTURE_50_DAILY	SOIL_MOISTURE_100_DAILY	SOIL_TEMP_5_DAILY	SOIL_TEMP_10_DAILY	SOIL_TEMP_20_DAILY	SOIL_TEMP_50_DAILY	SOIL_TEMP_100_DAILY	Unnamed: 28	year
LST_DATE
2016-01-01	64756	2.422	-73.74	41.79	3.4	-0.5	1.5	1.3	0.0	1.69	...	0.204	0.155	0.147	4.2	4.4	5.1	6.0	7.6	NaN	2016
2016-01-02	64756	2.422	-73.74	41.79	2.9	-3.6	-0.4	-0.3	0.0	6.25	...	0.199	0.152	0.144	2.8	3.1	4.2	5.7	7.4	NaN	2016
2016-01-03	64756	2.422	-73.74	41.79	5.1	-1.8	1.6	1.1	0.0	5.69	...	0.196	0.151	0.141	2.6	2.8	3.8	5.2	7.2	NaN	2016
2016-01-04	64756	2.422	-73.74	41.79	0.5	-14.4	-6.9	-7.5	0.0	9.17	...	0.194	0.148	0.139	1.7	2.1	3.4	4.9	6.9	NaN	2016
2016-01-05	64756	2.422	-73.74	41.79	-5.2	-15.5	-10.3	-11.7	0.0	9.34	...	0.191	0.148	0.138	0.4	0.9	2.4	4.3	6.6	NaN	2016
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
2022-12-27	64756	2.622	-73.74	41.79	-0.8	-8.0	-4.4	-3.8	0.0	4.00	...	NaN	0.164	0.157	-0.4	-0.2	0.5	2.2	4.0	NaN	2022
2022-12-28	64756	2.622	-73.74	41.79	7.4	-6.1	0.7	1.3	0.0	7.73	...	NaN	0.162	0.156	-0.4	-0.3	0.4	2.1	3.8	NaN	2022
2022-12-29	64756	2.622	-73.74	41.79	10.7	-1.8	4.4	5.0	0.0	6.66	...	NaN	0.159	0.155	-0.3	-0.3	0.3	1.9	3.7	NaN	2022
2022-12-30	64756	2.622	-73.74	41.79	16.6	4.9	10.7	10.3	0.0	5.39	...	NaN	0.159	0.154	-0.2	-0.2	0.3	1.8	3.6	NaN	2022
2022-12-31	64756	2.622	-73.74	41.79	13.2	2.7	7.9	10.2	5.0	1.25	...	NaN	0.160	0.153	-0.1	-0.2	0.3	1.8	3.4	NaN	2022

2557 rows × 29 columns

In [31]:

Copied!

# Get the unique year values
df.index.year.unique()
# Get the unique year values
df.index.year.unique()

Out[31]:

Index([2016, 2017, 2018, 2019, 2020, 2021, 2022], dtype='int32', name='LST_DATE')

In [32]:

Copied!

# Sometimes you may want to know the day of year:

df.index.dayofyear
# Sometimes you may want to know the day of year:

df.index.dayofyear

Out[32]:

Index([  1,   2,   3,   4,   5,   6,   7,   8,   9,  10,
       ...
       356, 357, 358, 359, 360, 361, 362, 363, 364, 365],
      dtype='int32', name='LST_DATE', length=2557)

In [33]:

Copied!

# You can also directly get the day of week, which is difficult to program on our own.

df.index.dayofweek
# You can also directly get the day of week, which is difficult to program on our own.

df.index.dayofweek

Out[33]:

Index([4, 5, 6, 0, 1, 2, 3, 4, 5, 6,
       ...
       3, 4, 5, 6, 0, 1, 2, 3, 4, 5],
      dtype='int32', name='LST_DATE', length=2557)

Exercise 2: Create a column in `df` that represents the month of the date.¶

In [ ]:

2. Temporal resampling¶

Another common operation we apply to time series is to change the resolution of a dataset by resampling in time. Pandas exposes this through the resample function. The resample periods are specified using pandas offset index syntax.

2.1 Resample function¶

Below we resample the dataframe by taking the mean over each month.

In [34]:

Copied!

# First, we need to remove the columns that do not have numerical values.
df = df.drop(columns = 'SUR_TEMP_DAILY_TYPE')
# First, we need to remove the columns that do not have numerical values.
df = df.drop(columns = 'SUR_TEMP_DAILY_TYPE')

In [35]:

Copied!

df.resample('ME').mean().plot(y='T_DAILY_MEAN', marker='o')
df.resample('ME').mean().plot(y='T_DAILY_MEAN', marker='o')

Out[35]:

<Axes: xlabel='LST_DATE'>

Resampling can be applied to the entire dataframe.

In [36]:

Copied!

df_mm = df.resample('ME').mean()
df_mm[['T_DAILY_MIN', 'T_DAILY_MEAN', 'T_DAILY_MAX']].plot()
df_mm = df.resample('ME').mean()
df_mm[['T_DAILY_MIN', 'T_DAILY_MEAN', 'T_DAILY_MAX']].plot()

Out[36]:

<Axes: xlabel='LST_DATE'>

Just like with groupby, we can apply any aggregation function to our resample operation.

In [37]:

Copied!

# Resample by maximum values.
df.resample('ME').max().plot(y='T_DAILY_MAX', marker='o')
# Resample by maximum values.
df.resample('ME').max().plot(y='T_DAILY_MAX', marker='o')

Out[37]:

<Axes: xlabel='LST_DATE'>

2.2 Rolling Operations¶

Rolling mean (or moving average) computes the average of data points in a sliding window over the series. It's typically used to smooth out short-term fluctuations and highlight longer-term trends.

In [38]:

Copied!

# Setting the window size to be 30. Center = True means the window is centered at a given index. For example, on 10/15/2022, it will return the mean from 10/01 to 10/30.
df.rolling(30, center=True).T_DAILY_MEAN.mean().plot()
# Setting the window size to be 30. Center = True means the window is centered at a given index. For example, on 10/15/2022, it will return the mean from 10/01 to 10/30.
df.rolling(30, center=True).T_DAILY_MEAN.mean().plot()

Out[38]:

<Axes: xlabel='LST_DATE'>

You may notice that there are some breaks in the time series. If there is missing value in the rolling window, pandas will return NaN for the rolling mean.

In [39]:

Copied!

# We find 12 observations with missing values for T_DAILY_MEAN. 
df.loc[(df.T_DAILY_MEAN.isnull())]
# We find 12 observations with missing values for T_DAILY_MEAN. 
df.loc[(df.T_DAILY_MEAN.isnull())]

Out[39]:

	WBANNO	CRX_VN	LONGITUDE	LATITUDE	T_DAILY_MAX	T_DAILY_MIN	T_DAILY_MEAN	T_DAILY_AVG	P_DAILY_CALC	SOLARAD_DAILY	...	SOIL_MOISTURE_20_DAILY	SOIL_MOISTURE_50_DAILY	SOIL_MOISTURE_100_DAILY	SOIL_TEMP_5_DAILY	SOIL_TEMP_10_DAILY	SOIL_TEMP_20_DAILY	SOIL_TEMP_50_DAILY	SOIL_TEMP_100_DAILY	Unnamed: 28	year
LST_DATE
2017-10-04	64756	2.622	-73.74	41.79	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2017
2018-10-13	64756	2.622	-73.74	41.79	NaN	NaN	NaN	NaN	4.6	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2018
2019-08-10	64756	2.622	-73.74	41.79	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2019
2019-08-11	64756	-9.000	-73.74	41.79	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2019
2019-08-12	64756	-9.000	-73.74	41.79	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2019
2019-08-13	64756	-9.000	-73.74	41.79	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2019
2019-08-14	64756	-9.000	-73.74	41.79	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2019
2019-08-15	64756	-9.000	-73.74	41.79	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2019
2019-08-16	64756	2.622	-73.74	41.79	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2019
2019-09-25	64756	2.622	-73.74	41.79	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2019
2021-01-04	64756	2.622	-73.74	41.79	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2021
2021-11-16	64756	2.622	-73.74	41.79	NaN	NaN	NaN	NaN	0.0	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2021

12 rows × 28 columns

Instead of setting the number of observations, we could set the time period of each window. Each window will be a variable sized based on the observations included in the time-period.

In [41]:

Copied!

# Note that days with missing values will be skipped. 
df.rolling('30D', center=True).T_DAILY_MEAN.mean().plot()
# Note that days with missing values will be skipped. 
df.rolling('30D', center=True).T_DAILY_MEAN.mean().plot()

Out[41]:

<Axes: xlabel='LST_DATE'>

Exercise 3: Calculate and plot 10-day rolling mean of daily max temperature `T_DAILY_MAX`.¶

In [ ]:

2.3 Filling missing data¶

Sometimes you may want to fill in the missing values. There are several ways to fill missing data in Pandas.

In [43]:

Copied!

# Let's first extract a subset of the data
df_sub = df.loc['2017-01-01':'2017-12-31']
# Let's first extract a subset of the data
df_sub = df.loc['2017-01-01':'2017-12-31']

In [44]:

Copied!

# Note there are data breaks in February.
df_sub['SOIL_MOISTURE_10_DAILY'].plot()
# Note there are data breaks in February.
df_sub['SOIL_MOISTURE_10_DAILY'].plot()

Out[44]:

<Axes: xlabel='LST_DATE'>

In [45]:

Copied!

# Fill values forward using ffill function

df_sub.ffill()['SOIL_MOISTURE_10_DAILY'].plot(label = 'Forward')

# Fill values backward using bfill function

df_sub.bfill()['SOIL_MOISTURE_10_DAILY'].plot(label = 'Backward')

# Using interpolate function:

df_sub['SOIL_MOISTURE_10_DAILY'].interpolate('linear').plot(label = 'Linear Interploation')

plt.legend()
# Fill values forward using ffill function

df_sub.ffill()['SOIL_MOISTURE_10_DAILY'].plot(label = 'Forward')

# Fill values backward using bfill function

df_sub.bfill()['SOIL_MOISTURE_10_DAILY'].plot(label = 'Backward')

# Using interpolate function:

df_sub['SOIL_MOISTURE_10_DAILY'].interpolate('linear').plot(label = 'Linear Interploation')

plt.legend()

Out[45]:

<matplotlib.legend.Legend at 0x3318f06d0>

Exercise 4: Based on the above figure, explain what is the difference among `ffill`, `bfill` and `interpolate`.¶

In [ ]:

3. Analyze the seasonality and cyclic patterns of time series¶

Recall the four major components of time series: seasonality, cycle, trend and variations. Many environmental datasets we deal with (e.g., temperature, precipitation, air quality) vary seasonally. A common way to analyze such data is to create a "climatology," which contains the average values in each month or day of the year. We can do this easily with groupby. Recall that df.index is a pandas DateTimeIndex object.

In [46]:

Copied!

monthly_climatology = df.groupby(df.index.month).mean(numeric_only=True)
monthly_climatology
monthly_climatology = df.groupby(df.index.month).mean(numeric_only=True)
monthly_climatology

Out[46]:

	WBANNO	CRX_VN	LONGITUDE	LATITUDE	T_DAILY_MAX	T_DAILY_MIN	T_DAILY_MEAN	T_DAILY_AVG	P_DAILY_CALC	SOLARAD_DAILY	...	SOIL_MOISTURE_20_DAILY	SOIL_MOISTURE_50_DAILY	SOIL_MOISTURE_100_DAILY	SOIL_TEMP_5_DAILY	SOIL_TEMP_10_DAILY	SOIL_TEMP_20_DAILY	SOIL_TEMP_50_DAILY	SOIL_TEMP_100_DAILY	Unnamed: 28	year
LST_DATE
1	64756.0	2.564857	-73.74	41.79	2.200000	-7.550463	-2.678241	-2.442130	2.217130	5.655139	...	0.230433	0.164727	0.166809	0.393056	0.472685	0.963889	1.983796	3.371759	NaN	2019.000000
2	64756.0	2.564424	-73.74	41.79	4.866162	-5.901010	-0.519192	-0.220202	3.449495	8.359444	...	0.219145	0.163520	0.165425	0.557071	0.518687	0.707071	1.256566	2.156061	NaN	2018.989899
3	64756.0	2.564857	-73.74	41.79	9.015668	-2.634101	3.184793	3.370046	2.426728	12.813917	...	0.224130	0.168618	0.166180	3.385714	3.270507	3.157604	3.182488	3.337788	NaN	2019.000000
4	64756.0	2.564857	-73.74	41.79	14.930476	1.948095	8.438095	8.733810	3.217143	14.977381	...	0.235376	0.165395	0.165433	9.685714	9.460476	8.950000	8.119524	7.199524	NaN	2019.000000
5	64756.0	2.564857	-73.74	41.79	20.996774	8.094009	14.548848	14.772811	3.182949	17.912673	...	0.221042	0.156848	0.161083	16.721198	16.471889	15.606019	14.184793	12.509217	NaN	2019.000000
6	64756.0	2.564857	-73.74	41.79	25.874286	12.033810	18.952857	19.285714	2.290000	21.610095	...	0.162114	0.136386	0.153190	22.399524	22.177619	21.112381	19.530476	17.661429	NaN	2019.000000
7	64756.0	2.564857	-73.74	41.79	29.040092	15.894470	22.465899	22.322120	3.880184	20.864101	...	0.113350	0.118535	0.139206	25.527189	25.406912	24.316204	22.814286	21.129630	NaN	2019.000000
8	64756.0	2.297069	-73.74	41.79	28.139048	15.543810	21.838571	21.615714	3.969048	18.131429	...	0.128214	0.122652	0.136627	24.912857	24.918095	24.244762	23.358571	22.286190	NaN	2019.000000
9	64756.0	2.564857	-73.74	41.79	23.720096	11.202871	17.460287	17.344019	3.948804	14.033301	...	0.151890	0.127550	0.141841	20.640191	20.736364	20.594258	20.666986	20.554067	NaN	2019.000000
10	64756.0	2.590664	-73.74	41.79	17.773488	5.738140	11.753023	11.740465	3.637963	9.193674	...	0.187633	0.141386	0.144750	14.663256	14.799070	15.074419	15.892558	16.645581	NaN	2019.000000
11	64756.0	2.593429	-73.74	41.79	10.377512	-1.277990	4.547368	4.720574	3.301905	6.567895	...	0.237167	0.166742	0.161895	7.128708	7.302392	7.955288	9.385167	10.990909	NaN	2019.000000
12	64756.0	2.593429	-73.74	41.79	4.488018	-4.917972	-0.217512	0.055760	3.121198	4.639954	...	0.242934	0.171530	0.170230	2.254839	2.385714	2.946083	4.288479	5.912442	NaN	2019.000000

12 rows × 28 columns

Each row in this new dataframe respresents the average values for the months (1=January, 2=February, etc.)

We can apply more customized aggregations, as with any groupby operation. Below we keep the mean of the mean, max of the max, and min of the min for the temperature measurements.

In [47]:

Copied!





monthly_T_climatology = df.groupby(df.index.month).aggregate({'T_DAILY_MEAN': 'mean',
                                                              'T_DAILY_MAX': 'max',
                                                              'T_DAILY_MIN': 'min'})
monthly_T_climatology
monthly_T_climatology = df.groupby(df.index.month).aggregate({'T_DAILY_MEAN': 'mean',
                                                              'T_DAILY_MAX': 'max',
                                                              'T_DAILY_MIN': 'min'})
monthly_T_climatology

Out[47]:

	T_DAILY_MEAN	T_DAILY_MAX	T_DAILY_MIN
LST_DATE
1	-2.678241	19.8	-26.0
2	-0.519192	24.9	-24.7
3	3.184793	26.8	-17.4
4	8.438095	30.6	-11.3
5	14.548848	33.4	-3.1
6	18.952857	34.5	1.5
7	22.465899	36.2	8.2
8	21.838571	36.5	6.0
9	17.460287	32.7	-1.6
10	11.753023	29.9	-5.9
11	4.547368	24.4	-15.9
12	-0.217512	17.9	-21.8

In [48]:

Copied!

monthly_T_climatology.plot(marker='o')
monthly_T_climatology.plot(marker='o')

Out[48]:

<Axes: xlabel='LST_DATE'>

If we want to do it on a finer scale, we can group by day of year.

In [49]:

Copied!





daily_T_climatology = df.groupby(df.index.dayofyear).aggregate({'T_DAILY_MEAN': 'mean',
                                                            'T_DAILY_MAX': 'max',
                                                            'T_DAILY_MIN': 'min'})
daily_T_climatology.plot(marker='.')
daily_T_climatology = df.groupby(df.index.dayofyear).aggregate({'T_DAILY_MEAN': 'mean',
                                                            'T_DAILY_MAX': 'max',
                                                            'T_DAILY_MIN': 'min'})
daily_T_climatology.plot(marker='.')

Out[49]:

<Axes: xlabel='LST_DATE'>

Exercise 5: Calculate and plot the seasonal cycle of solar radiation `SOLARAD_DAILY` at monthly scale.¶

In [ ]:

4. Analyze variations or anomalies¶

A common mode of time series analysis is to remove the seasonality or cyclic patterns from a signal to focus only on the variations or anomaly values. This can be accomplished with transformation.

In [50]:

Copied!

def standardize(x):
    return (x - x.mean())/x.std()

anomaly = df.groupby(df.index.month).transform(standardize)
def standardize(x):
    return (x - x.mean())/x.std()

anomaly = df.groupby(df.index.month).transform(standardize)

In [51]:

Copied!

anomaly.plot(y='T_DAILY_MEAN')
anomaly.plot(y='T_DAILY_MEAN')

Out[51]:

<Axes: xlabel='LST_DATE'>

5. Trend Analysis¶

We learned about linear regression from previous week. We can similarly apply the linear regression to time series data to fit a linear trend.

In [52]:

Copied!

import statsmodels.api as sm
import statsmodels.api as sm

To focus on the trend component, we could first resample the data to annual scale, so that the other components of the time series (seasonality and variations) do not affect the trends. Note here we just introduce the most basic trend analysis. Check out Venier et al. (2012) for some advanced trend analysis models used in Environmental Science research.

In [54]:

Copied!

df_year = df.resample('YE').mean()
df_year = df.resample('YE').mean()

In [55]:

Copied!

df_year.index
df_year.index

Out[55]:

DatetimeIndex(['2016-12-31', '2017-12-31', '2018-12-31', '2019-12-31',
               '2020-12-31', '2021-12-31', '2022-12-31'],
              dtype='datetime64[ns]', name='LST_DATE', freq='YE-DEC')

statsmodels do not recognize datetime object, so we need to convert Datetime into a numerical variable like float.

In [56]:

Copied!

# Define the independent (X) and dependent (y) variables
X = df_year.index.year   # Independent variable is converted to  year.
y = df_year['SOIL_MOISTURE_5_DAILY']  # Dependent variable
# Define the independent (X) and dependent (y) variables
X = df_year.index.year   # Independent variable is converted to  year.
y = df_year['SOIL_MOISTURE_5_DAILY']  # Dependent variable

In [57]:

Copied!

# Add a constant (intercept) to X
X_with_const = sm.add_constant(X)

# Note that the data have missing values. If we want to skip the missing values, we need to set missing to be 'drop'.
model = sm.OLS(y, X_with_const, missing = 'drop').fit()
# Add a constant (intercept) to X
X_with_const = sm.add_constant(X)

# Note that the data have missing values. If we want to skip the missing values, we need to set missing to be 'drop'.
model = sm.OLS(y, X_with_const, missing = 'drop').fit()

In [58]:

Copied!

pred_y = model.predict(X_with_const)
pred_y = model.predict(X_with_const)

In [59]:

Copied!

model.summary()
model.summary()

/Users/xjin/miniforge3/envs/esa_env/lib/python3.9/site-packages/statsmodels/stats/stattools.py:74: ValueWarning: omni_normtest is not valid with less than 8 observations; 7 samples were given.
  warn("omni_normtest is not valid with less than 8 observations; %i "

Out[59]:

OLS Regression Results
Dep. Variable:	SOIL_MOISTURE_5_DAILY	R-squared:	0.379
Model:	OLS	Adj. R-squared:	0.255
Method:	Least Squares	F-statistic:	3.056
Date:	Sun, 06 Oct 2024	Prob (F-statistic):	0.141
Time:	15:32:06	Log-Likelihood:	16.947
No. Observations:	7	AIC:	-29.89
Df Residuals:	5	BIC:	-30.00
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
const	-16.7584	9.704	-1.727	0.145	-41.703	8.186
x1	0.0084	0.005	1.748	0.141	-0.004	0.021

Omnibus:	nan	Durbin-Watson:	2.243
Prob(Omnibus):	nan	Jarque-Bera (JB):	0.522
Skew:	0.306	Prob(JB):	0.770
Kurtosis:	1.811	Cond. No.	2.04e+06

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.04e+06. This might indicate that there are
strong multicollinearity or other numerical problems.

In [61]:

Copied!

plt.plot(df_year.index,pred_y)
plt.plot(df_year.index,y)
plt.plot(df_year.index,pred_y)
plt.plot(df_year.index,y)

Out[61]:

[<matplotlib.lines.Line2D at 0x3366977f0>]

Exercise 6: In the above example, what is the long term trend of SOIL_MOISTURE? Is the trend statistically significant (p value < 0.05)?¶

In [ ]:

6. Assignment: Conduct time series analysis for a dataset you've found.¶

If the dataset you used for correlation analysis happens to have a temporal dimension, this should be a good dataset to use. This way, you could continue melding this assignment into your final paper.

If you have problems finding a good dataset, you could use the ozone measurements we used last week (Millbrook_NY_daily_ozone_2022.csv).

6.1 Find out the column that has date or time information. Read in the file using Pandas. Set the `parse_dates` to be the column(s) of the date and time.¶

In [ ]:

6.2 Make a time series plot of one of the measurements columns.¶

In [ ]:

6.3 Creat two new columns `year` and `month` to represent the year and month the data.¶

In [ ]:

6.4 Calculate and plot the monthly climatology of the measurements.¶

In [ ]:

6.5 Plot standardized measurements with seasonal cycle removed.¶

In [ ]:

6.6 Calculate the long-term trend of the data. Report the trend and statistical significance (p value) of the trend.¶

In [ ]:

Week 7: Time series analysis¶

1. Working with Datetime objects¶

1.1 Setting date range¶

1.2 Dealing with existing timestamps¶

1.3 Plotting time series in Pandas¶

Exercise 1: Make a time series plot of daily maximum and minmum temperature. You should see two lines.¶

1.4 Access datetime attributes¶

Exercise 2: Create a column in df that represents the month of the date.¶

2. Temporal resampling¶

2.1 Resample function¶

2.2 Rolling Operations¶

Exercise 3: Calculate and plot 10-day rolling mean of daily max temperature T_DAILY_MAX.¶

2.3 Filling missing data¶

Exercise 4: Based on the above figure, explain what is the difference among ffill, bfill and interpolate.¶

3. Analyze the seasonality and cyclic patterns of time series¶

Exercise 5: Calculate and plot the seasonal cycle of solar radiation SOLARAD_DAILY at monthly scale.¶

4. Analyze variations or anomalies¶

5. Trend Analysis¶

Exercise 6: In the above example, what is the long term trend of SOIL_MOISTURE? Is the trend statistically significant (p value < 0.05)?¶

6. Assignment: Conduct time series analysis for a dataset you've found.¶

6.1 Find out the column that has date or time information. Read in the file using Pandas. Set the parse_dates to be the column(s) of the date and time.¶

6.2 Make a time series plot of one of the measurements columns.¶

6.3 Creat two new columns year and month to represent the year and month the data.¶

6.4 Calculate and plot the monthly climatology of the measurements.¶

6.5 Plot standardized measurements with seasonal cycle removed.¶

6.6 Calculate the long-term trend of the data. Report the trend and statistical significance (p value) of the trend.¶

Exercise 2: Create a column in `df` that represents the month of the date.¶

Exercise 3: Calculate and plot 10-day rolling mean of daily max temperature `T_DAILY_MAX`.¶

Exercise 4: Based on the above figure, explain what is the difference among `ffill`, `bfill` and `interpolate`.¶

Exercise 5: Calculate and plot the seasonal cycle of solar radiation `SOLARAD_DAILY` at monthly scale.¶

6.1 Find out the column that has date or time information. Read in the file using Pandas. Set the `parse_dates` to be the column(s) of the date and time.¶

6.3 Creat two new columns `year` and `month` to represent the year and month the data.¶