Prediction for time series data using RNN with GRU layers

Compared to densely connected (feed forward) neural networks, recurrent neural networks (RNN) have better capability to model sequence data (such as text and time series data) by maintaining an internal loop over sequence elements. This project discusses how to use RNN with gated recurrent unit (GRU) layers to model the PM2.5 pollution real time data. The goal is to develop some guidelines and insights as how to apply deep learning methods to process high resolution time series data. Models are built using Keras.

Understand the data using Exploratory Data Analysis (EDA)

The data used in this project is the Beijing PM2.5 dataset, contains hourly PM2.5 and other information from 2010 - 2014 (5 years). Features are numerical and categorical include: dew point, temperature, pressure, wind direction, wind speed, cumulative hours of snow and rain. Two time series plots of the dataset are shown below. The first plot shows the PM2.5 over 5 years and the second plot shows annual PM2.5 (different years' data is on top of each other).

Some of the observations are:

Overall trend - PM2.5 seems to show a slowly decreasing trend over the years
Seasonality - PM2.5 seems to be the highest in spring, lowest in summer/fall, and medium in winter

The correlation matrix for numerical features as well as the distribution of PM2.5 are shown below.

Correlation matrix - Overall the features are not strongly correlated and also the number of features is small, so no dimension reduction has been applied on the features
Distribution plot - targets have a fat tail (which makes prediction more challenging as there can be many outliers/noise outside the general pattern)

Three Baseline Models & Predictions

Before building RNN models, three baseline or alternative models are built for predictions. This step is important as the final RNN model can only be accepted if it can beat the performance of the cheaper and simpler alternative models. For this problem, the following three types of models are used as baselines:

2. Gradient Boosting - ensemble tree methods (popular for structured data)

3. Densely Connected Network

1. Seasonal ARIMA model

The time series model only uses the targets (PM2.5) and is not using the feature data (and therefore not a supervised problem). From the ETS decomposition plot, the PM2.5 data seems to be stationary (agrees with the ADF test results - very small p-value), and has strong seasonal patterns.

The prediction result using a fitted ARIMA(2,0,3)(0,1,0) model is shown above. Note that the SARIMA model is fitted using the daily average PM2.5 data as in general ARIMA models are designed for shorter period data such as 12 for monthly due to estimation challenges.

Here, it can be seen that the SARIMA model actually does a decent job in capturing the overall pattern. This shows that the PM2.5 data has a strong global ordering pattern.

Remarks:

Advantage #1: Compared to machine learning algorithms, time series models such as seasonal ARIMA can also provide prediction intervals to show prediction uncertainties.

Advantage #2: Fitting a time series model such as seasonal ARIMA is computationally efficient (compared to deep learning algorithms) as the number of estimated parameters tend to be much smaller compared to neural networks.

Disadvantage #1 : Time series models such as seasonal ARIMA usually cannot handle very long period (high resolution) data due to estimation and memory challenges. Therefore, in this case, it is impossible to make hourly prediction using the model. Some alternatives do exist, for example, using the Fourier series approach to model the seasonal part.

Disadvantage #2 : Most time series models such as seasonal ARIMA (which is a linear model and can only model stationary data) in general can only capture simple structures.

Disadvantage for this problem : Time series models are not for supervised learning problems, and therefore in this case all features are not being used (which lost a lot of information).

2. Gradient Boosting Machine (GBM)

Gradient boosting is a very popular machine learning algorithm for structured data, which uses an ensemble of weak successive learners (trees) where each tree learns from the previous and gradually improve the performance (compared to Random Forest where an ensemble of independent trees are built). In practice, GBM is much easier to tune than neural networks (less number of hyper parameters to tune) and is more computationally efficient. In this case, only two parameters are tuned: number of trees and learning rate.

The prediction result using a GBM is shown above. Note that the GBM makes hourly prediction (in the graph the hourly predicted values are averaged to obtain the daily predictions). It can be seen that the prediction accuracy is better than the seasonal ARIMA model (MAE is much lower), the MAE is still high but given the fact that the PM2.5 is pretty noisy, the performance is acceptable (also the fact that not a lot of effort is spent on tuning the hyper parameters due to time constraint).

Remarks:

Advantage #1: For GBM (and also other tree-based methods), data pre-processing is usually not needed (scaling features, encoding categorical features) although feature engineering sometimes can still improve the performance. In this case, the features are normalized (as later will be used for other models).

Advantage #2: GBM (and also other tree-based methods) can handle missing values (no imputation is needed). However, sometimes it is still a good idea to impute missing values before building the model.

Advantage #3: Compared to deep learning algorithms, GBM (and also other tree-based methods) are more interpretable by using tools such as variable importance, partial dependence plots, LIME (although not as natural as statistical models such as GLMs).

Disadvantage for this problem : GBMs use all features in this supervised learning problem but has no capability of utilizing the global ordering pattern (recognizing it is a sequence data).

3. Densely Connected Networks

Before actually building a RNN, a simple DNN model (with a single layer) is built to see the performance of ignoring the sequence pattern of the data.

One thing to note is that the model shows significant overfitting (validation errors stay flat and high above the training errors). This possibly shows that the hypothesis class is not rich enough to fit the data (possibly because it's too simple and shallow). The prediction performance is better than the seasonal ARIMA model but not as good as the Gradient Boosting machine.

Remarks:

Because the dataset is large and dense, data generator functions (train, validation, test) are used to improve the computational efficiency. Here, observations are looked back 3 months (as indicated by seasonal patterns) and sampled every 6 steps (6 hours). Here are some of the advantages of using generator functions:
- Generator functions allow to declare a function like an iterator, which can be used in a for loop
- Generators use lazy generation of values (use data on the fly rather than wait for all data to be ready to use), and therefore results in lower memory usage
- Generators use simpler and more compact codes than iterators

A small batch size of 128 samples are used for fitting the model. In general, training length = batch size * steps per batch. Small batch size can result in slower computation speed but often can improve on accuracy.
- Large batches: very few # of epochs needed (always go toward the local min) which means can converge very fast. However, sometimes can got stuck in local optima and causes much lower accuracy.
- Small batches: more epochs needed (causes oscillation toward local min) and can converge slowly. However, it can help jump out of local optima and gets closer to global optima and therefore usually gives higher accuracy.

Simple RNN

GRU vs LSTM

GRUs are still relatively new compared to the well-known LSTM (Long Short Term Memory) for RNNs, they have similar mechanism as LSTM but has fewer gates and therefore is more computationally efficient (especially for long sequences where training RNNs can be very expensive). It is believed that in many applications GRUs have comparable performance as LSTM (although in theory the gain in computational efficiency is likely to come with less powerful representational capability). For this task because time constraint is a big issue, GRU layers are used instead of LSTM.

The first step of building the final RNN model is to start with a simple RNN model (with a single small GRU layer - same size as the DNN model). Compared to the DNN model, the prediction accuracy has a significant improvement but still not as good as the Gradient Boosting machine. Two things can be concluded:

From the training vs validation error plot, it is clear that there is significant overfitting - this means regularization needs to be applied to the model.

Compared to the training error, the validation errors seem to have much bigger fluctuations. This is pretty common for validation errors, possible solution includes tuning the learning rate or choose different optimizers.

Another observation is that the test error is much higher than the validation error. This usually implies that the test data may have a different distribution (and thus pattern) compared to the validation data. In this case, we know the PM2.5 data has an overall trend so it is possible that the training, validation, and test data could have different distributions.

Simple RNN - Regularized

In general, overfitting can be reduced by the following methods:

Get more training data (this is the most efficient way but is often not possible)

Make the model simpler (such as reduce the size and number of hidden layers)

Weight (parameter) regularization (for example, L1 or L2 regularization, similar to linear models)

Adding dropout - randomly set a number of output features to zero (for RNN, apply the same dropout rate from time to time, but can use different dropout rates for input units and recurrent units)

Here, the dropout method is being used (a rate of 0.2 is used) for both input units and recurrent units. It can be seen that the validation scores are now more consistent with the training scores (although it still has big fluctuations). Also, it is not surprised to see that the prediction performance of the regularized version is very similar to the unregularized version - regularization in general cannot improve the accuracy of the model.

Now, as the RNN is no longer overfitting but seems to reach the bottle neck of performance, the next step is to increase the capacity of the model.

More Complex RNN

To increase the capacity of the model, a more complex RNN model is used (with two layers and each layer is now having bigger size). Once again regularization using dropout is used to reduce overfitting.

By increasing the capacity of the model, the prediction accuracy has been improved significantly (and also seems it is not overfitting significantly). From the plot, it can be seen that the model is pretty successful in capturing most of the patterns (even many of the complicated patterns).

Remarks:

It should be noted that the stacked RNN model is much more expensive than the simpler RNN models, as the number of trainable parameters have increased dramatically.

Further steps could include: increase the capacity of the model again (more number of layers and bigger size of each layer), tune the learning rate, choose other optimizers, increase the number of steps per epoch, etc.

Combing CNN and RNN together

One of the new approaches to process long sequences is to combine a 1D ConvNet with RNN by using the 1D CNN as a preprocessing step before the RNN step. The idea is to convert the long input sequences into shorter sequences of higher level features, and then use those features as input into the RNN. The main advantage of this approach is to provide a computational efficient alternative to RNNs, as 1D CNNs are much cheaper.

Here, the model consists of stacking 1D Conv layers and 1D Max Pooling layers before a two layer RNN (same size as previous model).

As seen from the results, the 1D CNN - RNN is not performing as well as the two layer RNN (possibly at the cost of using preprocessed high level features by CNN), but still has better performance than simpler RNNs and baseline models. It should be noted that the training time for this model is much shorter than the same sized RNN model - which gives a good computationally efficient alternative to complex RNNs on long sequences (at the cost of some prediction accuracy).

Remarks:

One of the reasons that the 1D CNN - RNN model is not performing as good as in some other applications (such as sentiment analysis) is that here the PM2.5 values have strong long period global ordering patterns. Because CNNs are doing no effort to capture temporal structures but instead looking for patterns around in any order of time steps, the temporal information is likely to be less successfully processed compared to a RNN alone.

Summary

The goal of this exercise is to understand various types of prediction models for high resolution time series data. Models used include seasonal ARIMA, Gradient Boosting Machine, Densely Connected Network, Recurrent Neural Networks, as well as 1D Convolutional Neural Network combined with Recurrent Neural Networks.

Although for this particular problem RNNs seem to the best choice, other alternative models do have their own advantages such as computational efficiency, uncertainty estimations, etc. Therefore, choosing the best model in many applications usually depend on not only the prediction accuracy, but also on other measures.

Last updated on Dec 1, 2019