Can Transformers learn to add like humans?

-Photo by Antoine Dautry on Unsplash

Contents

  1. Introduction
  2. Create the dataset
  3. Transformer model architecture
  4. Train and test
  5. Self-Attention visualization

Introduction

Imports

Create the dataset

Encoded version of 03144+259:
[[1 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 1 0 0 0 0 0 0 0 0 0]
[0 1 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 1 0 0 0 0 0 0 0 0]
[0 0 0 0 1 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 1 0 0]
[0 0 1 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 1 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 1 0 0 0]]

Encoded version of =03403:
[[0 0 0 0 0 0 0 0 0 0 0 0 1]
[1 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 1 0 0 0 0 0 0 0 0 0]
[0 0 0 0 1 0 0 0 0 0 0 0 0]
[1 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 1 0 0 0 0 0 0 0 0 0]]

Transformer model architecture

Transformer Block

Image from: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Multi-Head Self-Attention

  1. Project the input vector (input embeddings) to three spaces of the same dimension. This projection will define the vectors Q, K, and V (Query, Key, and Values respectively). We define the dimension of the Q, K and V vectors as the integer division between the input dimension and the number of heads. In this way, when concatenating the output of this layer, we will have vectors of the same length as the input vectors.
  2. Compute the scalar product between the vectors Q and K to obtain a matrix of weights.
  3. Normalize the matrix using the square root of the length of the input embeddings from step 1, d, and use the softmax function to create a new matrix of weights. We will return this matrix together with the new vectors as it contains very relevant information about the relationship between the embeddings. This way we will later be able to visualize the relationships between them.
  4. Multiply this new weights matrix by the vector V to obtain the final result.
Image from: peltarion.com

MLP

Encoding of the character ‘1’ in position 0 and position 1. Image by author.

The model

  1. First, we project the one-hot-encoded vectors to a space of the dimension we want.
  2. Next, we add a position to each of the vectors. To do this, we use the implemented PositionEncoder layer.
  3. We add a few TransformerBlock layers that contain the attention module.
  4. We combine the result to obtain the desired size.
png

Train and test

PREDICTION ACCURACY (%):
Train: 99.719, Test: 99.609
png
Ground-truth: 6952-8937 =-1985
Prediction: 6952-8937 =-1985

Ground-truth: 7137-1240 =05897
Prediction: 7137-1240 =05897

Ground-truth: 2033-2351 =-0318
Prediction: 2033-2351 =-0318

Self-Attention visualization

png

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
DataScienceUB

DataScienceUB

Data Science and Machine Learning Lab at the Universitat de Barcelona https://datascience.ub.edu/research