Temporal Difference Learning Bean

The temporal difference learning model is a reinforcement learning neural network model used in time-series forecasting problems. It is a feed-forward model that uses an arbitrary length sequence of real-valued input vectors followed by a single target output vector to train the network. In contrast to back propagation which is a supervised training algorithm, the temporal difference model receives no explicit target output pattern at each time step. Instead, it uses the difference between successive outputs as the error measure. An error-minimization method called gradient descent is used to reduce the output error by adjusting the network weights.

The training algorithm, like back propagation, is computationally expensive. The training data set consists of a sequence of real-valued input vectors, an output or target vector, and a flag field which identifies the training patterns position in a training sequence. Here is an example:

     input    out  flag
    0 0 1 0 0   0     1
    0 1 0 0 0   0     0
    1 0 0 0 0   0     0
    0 1 0 0 0   0     0
    1 0 0 0 0   0     0
    0 0 0 0 0   1     2

In this example, a flag value of 1 indicates the start of a training sequence. The next pattern's flag value of 0 indicates a training sequence is in process. The temporal difference algorithm presents the input pattern (0 1 0 0 0) to the network and produces an output, or prediction. This output value is used as the target output for the previous pattern. The network state is actually changed back to the previous (0 0 1 0 0) state and the weights are adjusted using the difference between the successive outputs as the error. This procedure continues until a pattern with a flag value of 2 is reached. Since this signifies the end of a training sequence, the input value is ignored, but the output value is taken as the target value for the error calculations on the previous pattern.

Although the desired output pattern is not used unless the flag field is equal to 2, each record in the training data file must consist of an input vector, an output vector, and the flag value. A a training file can contain many sets of training sequences. An epoch consists of the complete set of training sequences.

When used in Adaptive Critic mode the direct reinforcement values are placed in the position of the output vector. Adaptive Critic mode is in use when the Gamma parameter is greater than zero.

Implementation

The Temporal Difference model is implemented as described by Sutton in his article Learning to predict by the method of temporal differences. It is based on the back propagation model. The activation function is the logistic function.

Each unit receives inputs from all units in the preceding layer, along with a bias or threshold weight. A learn rate multiplier can be used to modify the learn rate applied to selected layers of weights. You can update weights either after each pattern is presented or only after a complete epoch is presented.

A tolerance parameter sets a threshold on the acceptable error between the actual output activation and the desired value. Usually this tolerance should usually be set to 0.0; otherwise no learning will occur.

Error information consists of root-mean-square (RMS) errors on the last pattern, the worst pattern in the epoch, and the average of all patterns. Also provided are the number of bad outputs (output - desired > tolerance) for the last pattern and the number of bad patterns (patterns with at least one bad output).

Architecture

The number of inputs and outputs is determined by your problem definition and your knowledge representation decisions. An additional parameter is the number of hidden units. It is a common practice to try to keep this number small in order to improve the network's ability to generalize. However, if you select too small a number, the network may not converge on the training set, or may take a very long time. Setting this parameter is part of the art of trying to specify and use back propagation networks in applications.

Usually only one hidden layers is used. In some cases, you may be able to do without any hidden layers. This results in greatly improved training time. If the function you are trying to learn is very complex, you may have to use two or three hidden layers, with a corresponding degradation in training speed.

Architecture parameters

When creating a temporal difference network, you must specify these parameters:

Number of inputs
Sets the number of units allocated for the input layer. This must be an integer value greater than or equal to 1.
Number of hidden units in layer 1
Sets the number of units allocated for the first hidden layer (if any). This must be an integer value greater than or equal to 0.
Number of hidden units in layer 2
Sets the number of units allocated for the second hidden layer (if any). This must be an integer value greater than or equal to 0.
Number of hidden units in layer 3
Sets the number of units allocated for the third hidden layer (if any). This must be an integer value greater than or equal to 0.
Number of outputs
Sets the number of units allocated for the output layer. This must be an integer value greater than or equal to 1.

Key Parameters

Learn rate
Controls how much the weights are changed during a weight update. The larger the value the more the weights are changed.
Learn rate multiplier
Controls the learn rate for each layer in the network. A value of 1 means that all units in all layers will use the learn rate value. A value less than 1 means that preceding layers of units will use a learn rate value less than the output layer. A value greater than 1 means that preceding layers of units will use a higher learn rate value than the output layer.
Activation function
Selects whether the activation function is standard, ranging from 0 to 1 (OFF) or symmetric, ranging from -0.5 to +0.5 (ON).
Activation temp
Controls the slope of the activation function. A value of 1 gives the standard logistic function. A value of 10 gives an almost linear function. A value of 0.1 yields an almost binary or step function.
Lambda
This is the key parameter in temporal difference networks. It controls how the errors between successive predictions are passed back in time. It is an exponential weighting factor that controls the temporal credit assignment which is the basis of reinforcement learning. A typical value for lambda is 0.5.
Last RMS error
Is updated after each step and shows the root-mean-square (RMS) error for a single training pattern.
Ave RMS error
Is updated after each epoch and shows the average RMS error over all of the patterns.
Error tolerance
Sets the value for the acceptable difference between the desired output value and the actual output value. This parameter should usually be set to 0.0 or to a very small value (0.001), because the difference between successive predictions is usually small.
Compute Sensitivity
Controls whether the accumulated error index is computed during training. This information indicates the relative importance of the inputs to the error.
Epoch updates
Controls whether the network weights are updated after every pattern presentation (False) or only after a complete training epoch (True).
Num of bad outputs
Indicates the number of output units that are out of the specified error tolerance for the training pattern.
BadPatRatio
Indicates the number of patterns in the previous epoch which have errors above tolerance divided by the total number of patterns.
Max pattern error
Is updated after each epoch and shows the largest root-mean-square (RMS) error for a single training pattern. Use this if you are training a network to respond to every pattern in the training set with a certain degree of error.
Gamma
Controls the operating mode of the temporal difference network. If set to 0.0, the network is a regular temporal difference learning network, it calculates its errors by taking the difference between successive predictions. If it is set a value greater than 0.0, it is an Adaptive Critic network. The target output is taken to be Reinforcement + (Gamma * NetOutput). The Reinforcement value is taken from the network input buffer (instead of ignoring the target value, it is taken as the reinforcement value).
Reinforcement
Controls the target value when the network is used in Adaptive Critic mode. The reinforcement value is actually taken from the network Input Buffer.

Training

There are several important parameters to set during training. The lambda setting is important since it controls the temporal credit assignment process. Larger values, those approaching 1.0, give more weight to recent patterns in the sequence. Smaller values, those approaching 0.0, pass the error information farther back in the training sequence.

The learning rate serves the same function as in back propagation. It specifies the step size when making weight updates. A value of 0.3 to 0.7 is common for learning rate. As with back propagation, the best approach to take in setting these parameters is often determined by trial and error.

The error tolerance setting controls the training process. If the data set contains binary targets (0,1), then the tolerance parameter is usually set to 0.0 or 0.01. This means that the output is considered "good" when it is within 0.01 of the desired output (that is, 0.99 for a 1, 0.01 for a 0). When every output is within the tolerance range of the desired output value, the network status is changed to LOCKED and weight updates are stopped.

You can also set the epoch update flag. If set to TRUE, the weights are changed only after every complete cycle through the training set. This is true gradient descent. When the epoch update flag is set to FALSE, weights are updated after each pattern is presented. For most problems, the network converges more quickly with the epoch update flag set to FALSE.

It is very important NOT to randomize the input patterns when training a temporal difference network. Since the whole goal of the training process is for the network to learn to predict based on the sequence of inputs, randomly presenting inputs will inhibit the training process.

Running

While training may be slow, the run-time performance of a temporal difference network is relatively fast. The input vector is propagated through the network by multiplying it by a weight matrix and then passing it through an activation function. This vector is then multiplied by the succeeding weight matrices in a similar manner, depending on the number of hidden layers in the network. The outputs of the last layer of units are returned to the application program in the output array.