Prediction for time series data using RNN with GRU layers

A beautiful picture

Compared to densely connected (feed forward) neural networks, recurrent neural networks (RNN) have better capability to model sequence data (such as text and time series data) by maintaining an internal loop over sequence elements. This project discusses how to use RNN with gated recurrent unit (GRU) layers to model the PM2.5 pollution real time data. The goal is to develop some guidelines and insights as how to apply deep learning methods to process high resolution time series data. Models are built using Keras.


Understand the data using Exploratory Data Analysis (EDA)

The data used in this project is the Beijing PM2.5 dataset, contains hourly PM2.5 and other information from 2010 - 2014 (5 years). Features are numerical and categorical include: dew point, temperature, pressure, wind direction, wind speed, cumulative hours of snow and rain. Two time series plots of the dataset are shown below. The first plot shows the PM2.5 over 5 years and the second plot shows annual PM2.5 (different years' data is on top of each other).

A beautiful picture A beautiful picture

Some of the observations are:

The correlation matrix for numerical features as well as the distribution of PM2.5 are shown below.

A beautiful picture A beautiful picture


Three Baseline Models & Predictions

Before building RNN models, three baseline or alternative models are built for predictions. This step is important as the final RNN model can only be accepted if it can beat the performance of the cheaper and simpler alternative models. For this problem, the following three types of models are used as baselines:

    1. Seasonal ARIMA model - classical linear time series model for seasonal data
    2. Gradient Boosting - ensemble tree methods (popular for structured data)
    3. Densely Connected Network


1. Seasonal ARIMA model

A beautiful picture

The time series model only uses the targets (PM2.5) and is not using the feature data (and therefore not a supervised problem). From the ETS decomposition plot, the PM2.5 data seems to be stationary (agrees with the ADF test results - very small p-value), and has strong seasonal patterns.

A beautiful picture

The prediction result using a fitted ARIMA(2,0,3)(0,1,0) model is shown above. Note that the SARIMA model is fitted using the daily average PM2.5 data as in general ARIMA models are designed for shorter period data such as 12 for monthly due to estimation challenges.

Here, it can be seen that the SARIMA model actually does a decent job in capturing the overall pattern. This shows that the PM2.5 data has a strong global ordering pattern.

Remarks:


2. Gradient Boosting Machine (GBM)

Gradient boosting is a very popular machine learning algorithm for structured data, which uses an ensemble of weak successive learners (trees) where each tree learns from the previous and gradually improve the performance (compared to Random Forest where an ensemble of independent trees are built). In practice, GBM is much easier to tune than neural networks (less number of hyper parameters to tune) and is more computationally efficient. In this case, only two parameters are tuned: number of trees and learning rate.

A beautiful picture

The prediction result using a GBM is shown above. Note that the GBM makes hourly prediction (in the graph the hourly predicted values are averaged to obtain the daily predictions). It can be seen that the prediction accuracy is better than the seasonal ARIMA model (MAE is much lower), the MAE is still high but given the fact that the PM2.5 is pretty noisy, the performance is acceptable (also the fact that not a lot of effort is spent on tuning the hyper parameters due to time constraint).

Remarks:


3. Densely Connected Networks

Before actually building a RNN, a simple DNN model (with a single layer) is built to see the performance of ignoring the sequence pattern of the data.

A beautiful picture A beautiful picture

One thing to note is that the model shows significant overfitting (validation errors stay flat and high above the training errors). This possibly shows that the hypothesis class is not rich enough to fit the data (possibly because it's too simple and shallow). The prediction performance is better than the seasonal ARIMA model but not as good as the Gradient Boosting machine.

Remarks:


Simple RNN

GRU vs LSTM

GRUs are still relatively new compared to the well-known LSTM (Long Short Term Memory) for RNNs, they have similar mechanism as LSTM but has fewer gates and therefore is more computationally efficient (especially for long sequences where training RNNs can be very expensive). It is believed that in many applications GRUs have comparable performance as LSTM (although in theory the gain in computational efficiency is likely to come with less powerful representational capability). For this task because time constraint is a big issue, GRU layers are used instead of LSTM.


The first step of building the final RNN model is to start with a simple RNN model (with a single small GRU layer - same size as the DNN model). Compared to the DNN model, the prediction accuracy has a significant improvement but still not as good as the Gradient Boosting machine. Two things can be concluded:

A beautiful picture A beautiful picture

Simple RNN - Regularized

In general, overfitting can be reduced by the following methods:

A beautiful picture A beautiful picture
Here, the dropout method is being used (a rate of 0.2 is used) for both input units and recurrent units. It can be seen that the validation scores are now more consistent with the training scores (although it still has big fluctuations). Also, it is not surprised to see that the prediction performance of the regularized version is very similar to the unregularized version - regularization in general cannot improve the accuracy of the model.

Now, as the RNN is no longer overfitting but seems to reach the bottle neck of performance, the next step is to increase the capacity of the model.

More Complex RNN

To increase the capacity of the model, a more complex RNN model is used (with two layers and each layer is now having bigger size). Once again regularization using dropout is used to reduce overfitting.

A beautiful picture A beautiful picture
By increasing the capacity of the model, the prediction accuracy has been improved significantly (and also seems it is not overfitting significantly). From the plot, it can be seen that the model is pretty successful in capturing most of the patterns (even many of the complicated patterns).

Remarks:


Combing CNN and RNN together

One of the new approaches to process long sequences is to combine a 1D ConvNet with RNN by using the 1D CNN as a preprocessing step before the RNN step. The idea is to convert the long input sequences into shorter sequences of higher level features, and then use those features as input into the RNN. The main advantage of this approach is to provide a computational efficient alternative to RNNs, as 1D CNNs are much cheaper.

Here, the model consists of stacking 1D Conv layers and 1D Max Pooling layers before a two layer RNN (same size as previous model).

A beautiful picture A beautiful picture
As seen from the results, the 1D CNN - RNN is not performing as well as the two layer RNN (possibly at the cost of using preprocessed high level features by CNN), but still has better performance than simpler RNNs and baseline models. It should be noted that the training time for this model is much shorter than the same sized RNN model - which gives a good computationally efficient alternative to complex RNNs on long sequences (at the cost of some prediction accuracy).

Remarks:


Summary

The goal of this exercise is to understand various types of prediction models for high resolution time series data. Models used include seasonal ARIMA, Gradient Boosting Machine, Densely Connected Network, Recurrent Neural Networks, as well as 1D Convolutional Neural Network combined with Recurrent Neural Networks.

Although for this particular problem RNNs seem to the best choice, other alternative models do have their own advantages such as computational efficiency, uncertainty estimations, etc. Therefore, choosing the best model in many applications usually depend on not only the prediction accuracy, but also on other measures.


Last updated on Dec 1, 2019