Can Transformers learn to add like humans?

8 min readMar 22, 2022

by Pere Gilabert, PhD student at the University of Barcelona

Introduction
Create the dataset
Transformer model architecture
Train and test
Self-Attention visualization

Introduction

Transformers are here to stay. Many problems that had until now been solved sequentially with the use of recurrent networks (LSTM, GRU…) can now be better solved with the use of Transformer-based architectures. This new type of models, use attention between sequences to generate a longer-term memory better than a recurrent network can generate.

In this post, we are going to adapt a classical recurrent network problem that can be found in Keras’ blog. We are going to build an architecture based on a transformer that uses a self-attention module and encodes the position of all input elements.

The problem we solve is the following:

Given a text sequence containing an addition or a subtraction operator, we aim to predict the result of that operation.

For example, given: ”123+456" we want the model to predict ”=579". So, we want the model to learn adding or subtracting numbers by introducing the equal character at the beginning of each predicted result.

For the sake of simplicity, we assume that the operation between these two numbers does not exceed a certain length. To do so, each of the numbers are limited to a maximum value of {NUM_LENGTH}.

Imports

We are going to use Keras framework to build the model and Tensorflow to define some operations inside the attention layer. We will also use matplotlib to draw the learning curves.

Create the dataset

To generate the data, we use the tf.keras.utils.Sequence module which allows us to obtain new examples at each training step. Besides, we define some functions to help us transform text sequences to a data type that can be fit to the model.

We use the one-hot-encoding method to encode the inputs and outputs where each number 0–9 is represented with a 13-position vector with a 1 in the position indicated by the number itself (e.g, number 2 will be encoded with the vector [0,0,1,0,0,0,0,0,0,0,0,0,0]). The symbols “+”, “-” and “=” are encoded with the vectors [0,0,0,0,0,0,0,0,0,0,1,0,0], [0,0,0,0,0,0,0,0,0,0,0,1,0] and [0,0,0,0,0,0,0,0,0,0,0,0,1] respectively.

Here you have a random example:

Encoded version of 03144+259:
[[1 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 1 0 0 0 0 0 0 0 0 0]
 [0 1 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 1 0 0 0 0 0 0 0 0]
 [0 0 0 0 1 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 1 0 0]
 [0 0 1 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 1 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 1 0 0 0]]

Encoded version of =03403:
[[0 0 0 0 0 0 0 0 0 0 0 0 1]
 [1 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 1 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 1 0 0 0 0 0 0 0 0]
 [1 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 1 0 0 0 0 0 0 0 0 0]]

We define a set of examples for training and another set for validation.

Transformer model architecture

Now comes the interesting part. We are going to implement a set of layers by extending the tf.keras.Layer class and then we will combine them to create the final model.

Transformer Block

Let’s start by implementing the Transformer Block illustrated in the following image.

Image from: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

The input and output of this block have exactly the same dimension. A set of embeddings are fitted and then modified according to the relationships among them in the attention layer. Finally, they enter in a Multi-Layer-Perceptron (MLP).

Multi-Head Self-Attention

The first block is the Attention layer. We implement Self-Attention with multiple heads which can be modified by using the {num_heads} parameter. Since at the end of the attention layer we need to rebuild the vector, we have to ensure that the input dimension is divisible by the number of heads.

The operations we apply are the following ones:

Project the input vector (input embeddings) to three spaces of the same dimension. This projection will define the vectors Q, K, and V (Query, Key, and Values respectively). We define the dimension of the Q, K and V vectors as the integer division between the input dimension and the number of heads. In this way, when concatenating the output of this layer, we will have vectors of the same length as the input vectors.
Compute the scalar product between the vectors Q and K to obtain a matrix of weights.
Normalize the matrix using the square root of the length of the input embeddings from step 1, √d, and use the softmax function to create a new matrix of weights. We will return this matrix together with the new vectors as it contains very relevant information about the relationship between the embeddings. This way we will later be able to visualize the relationships between them.
Multiply this new weights matrix by the vector V to obtain the final result.

All this can be expressed with the equation:

MLP

We implement now the last layer of this block: the MLP. It consists of two fully-connected layers with dropout and a GELU activation between them. This can be implemented as follows:

With the implementation of these two layers, we can now build the Transformer Block.

We need two layers of normalization (LayerNormalization) and some residual connections to avoid the vanishing of the gradient.

Finally, before building the final model, we define a layer to encode the position of each element of the sequence. Unlike when using recurrent networks, the Transformer loses track of the order of the elements and it needs to be encoded somehow. On the initial version of the Transformer, they used specific functions based on the trigonometric functions sine and cosine. However, with newer versions of Transformer, we can use an embedding learned during training in order to let the model encode the position itself.

What this layer is doing is assigning to each position of the input sequence a concrete embedding that will be added to the embedding of the character. This way, two equal characters will have a different final embedding depending on where they are placed in the input text sequence.

Encoding of the character ‘1’ in position 0 and position 1. Image by author.

The model

Let’s define some important parameters:

Now, we are going to put together all the layers we have defined to create the final model.

First, we project the one-hot-encoded vectors to a space of the dimension we want.
Next, we add a position to each of the vectors. To do this, we use the implemented PositionEncoder layer.
We add a few TransformerBlock layers that contain the attention module.
We combine the result to obtain the desired size.

In the example, as we are using 4-digit numbers, the input size is 9 positions (2*4+1). The +1 is given by the operation symbol. The expected result will have 6 values since the sum of two 4-digit numbers can give a number up to 5 digits and we have to take into account that we have to add the equal symbol.

Train and test

Now we can train the model. We train it for a maximum of 300 epochs, doing early stopping if the val_loss increases during 20 epochs. Each epoch is 100 batches long and each batch contains 256 examples.

PREDICTION ACCURACY (%):
Train: 99.719, Test: 99.609

Very well! We have obtained a near-perfect result in the training set which is also transferred to the validation set.

Here you have some examples:

Ground-truth: 6952-8937 =-1985
  Prediction: 6952-8937 =-1985

Ground-truth: 7137-1240 =05897
  Prediction: 7137-1240 =05897

Ground-truth: 2033-2351 =-0318
  Prediction: 2033-2351 =-0318

Self-Attention visualization

Finally, let’s visualize the weight matrices in the attention layer so that we can understand the relationships learnt by the model.

First, we visualize the average attention of all 8 heads of a particular layer, in this case the first one.

The result we get is very interesting! The model learns to ‘detect’ where the operation character (vertical bar) is located and actually understands that it is very important information for the output. We also observe how it learns to relate numbers that are combined with each other. For example, in the first weights of the figure, we see clearly how there is a strong relationship between 4 and 1, 5 and 2, 6 and 3 and between 7 and 4, which are numbers that need to be combined in order to compute the result.

We can go even further and show all matrices of all layers and all available heads. As we have 4 layers and 8 heads, this is a total of 32 figures.

This is the output for the input “1234+4567”:

Fun fact: look at the first image from the second row. Do you know what this yellow diagonal represents? It is what we used to call ‘carry’ in school when adding two numbers if the result was greater than 9. This model is learning to add like a human!

Colab: https://colab.research.google.com/drive/1V4xUetq1JNUgCK-nch3hzhGy0S4CGCIz?usp=sharing