What Is A Transformer Model And How Does It Work?

August 7, 2023

In the language industry, transformer models are driving innovation forward. 

With the availability of cloud storage and big data,  machine learning is excelling at the accuracy of language generation and translation. It has turbocharged linguistics across IT industries, healthcare, e-commerce, and automotive GPS systems.

Our human brains may fear the speed at which machines have become better at interpreting and analyzing human words and sentiments. One such model is the transformer model, which has revolutionized the tech industry.

By translating and generating new words, the transformer model is a version of the artificial neural network software that automates the delivery of critical information and data.

Transformers can translate multiple text sequences together, unlike existing neural networks such as recurrent neural networks (RNNs), gated RNNs, and long short-term memory (LSTMs). This ability is derived from an underlying “attention mechanism” that prompts the model to tend to important parts of the input statement and leverage the data to generate a response.

Transformer models recently outpaced older ones in machine learning and have become prominent in solving language translation problems. Original transformer architecture has formed the basis of AI text generators, like a ChatGPT, GPT-2, bidirectional encoder representations from transformers (BERT), and  MegaMOIBART.

A transformer can be monolingual or multilingual, depending on the input sequence you feed. It analyzes text by remembering the memory locations of older words. All the words in the sequence are processed at once, and relationships are established between words to determine the output sentence. For this reason, transformers are highly parallelizable and can execute multiple lines of code.

Types of transformer models

The architecture of a transformer depends on which AI model you train it on, the size of the training dataset, and the vector dimensions of word sequences. Mathematical attributes of input and pre-trained data need to be factored in before adopting a specific architecture for your business use case.

  • Encoder-only architecture is a double-stacked transformer that uses the input tokens to predict output tokens. Examples are BERT and Google Bard. 
  • An encoder-decoder model uses all six layers of the neural network to position word sequences and derive English counterparts. Examples are T5 and Deepmind’s AlphaFold and AlphaStar.
  • Decoder-only architecture sees the input fed as a prompt to the model without recurrence. The output depends on what next words the model would select based on the previous words. Examples are Open AI’s GPT and GPT-2.
  • Bidirectional Auto Regressive Transformer, or BART, is based on natural language processing (NLP) and designed to pre-train sequence-to-sequence models. It uses transfer learning to train a model to reconstruct the original text.

How do transformers work?

Mainly used for language translation and text summarization, transformers can scan words and sentences with a clever eye. Artificial neural networks shot out of the gate as the new phenomenon that solved critical problems like computer vision and object detection. The introduction of transformers applied the same intelligence in language translation and generation.

transformer application

The main functional layer of a transformer is an attention mechanism. When you enter an input, the model takes care of the most important parts of the input and studies it contextually. A transformer can traverse long queues of input to access the first part or the first word and produce contextual output.

The entire mechanism is spread across 2 major layers of encoder and decoder. Some models are only powered with a pre-trained encoder, like BERT,  which works with doubled efficiency.

A full-stacked transformer architecture contains six encoder layers and six decoder layers. This is what it looks like.

transformer architecture

Each sublayer of this transformer architecture is designated to treat data in a specific way for accurate results. Let’s break down these sub-layers in detail.

What is an encoder?

There are six encoder layers and six decoder layers in a transformer. The job of an encoder is to convert a text sequence into abstract continuous number vectors and judge which words have the most influence over one another.


The encoder layer of a transformer network converts the information from textual input into numerical tokens. These tokens form a state vector that helps the model understand the input better. First, the vectors go under the process of input embedding.

1. Input Embedding

The input embedding or the word embedding layer breaks the input sequence into process tokens and assigns a continuous vector representation to every token. 

For example, If you are trying to translate “How are you” into German, each word of this arrangement will be assigned a vector number. You can refer to this layer as the “Vlookup” table of learned information. 

input embedding

2. Positional encoding

Next comes positional encoding. As transformer models have no recurrence, unlike recurrent neural networks, you need the information on their location within the input sequence. 

Researchers at Google came up with a clever way to use sine and cosine functions in order to create positional encodings. Sine is used for words in the even time step, and cosine is used for words in the odd time step.

positional encoding

Below is the formula that gives us positional information of every word at every time step in a sentence.

Positional encoding formula:

  1. PE (Pos, 2i+1) = cos (pos/10000 raised to power 2i/dmodel)

  2. PE(Pos, 2i) = sin (pos/10000 raised to power 2i/dmodel))


PE → Positional encoding

i → time step

D (model)  → Total vector dimension of the input sequence

These positional encodings are kept as a reference so the neural networks can find important words and embed them in the output. The numbers are passed on to the “attention” layer of the neural network.

positional encoding

3. Multi-headed attention and self-attention

The multi-headed attention mechanism is one of a transformer neural network's two most important sublayers. It employs a " self-attention " technique to understand and register the pattern of the words and their influence on each other.


Again taking the earlier example, for a model to associate “how” with “wie,” “are” with “heist,” and “you” with “du,” it needs to assign proper weightage to each English word and find their German counterparts. Models also need to understand that sequences styled in this way are questions and that there is a difference in tone. This sentence is more casual, whereas if it were "wie hiessen sie," it would have been more respectful.

The input sequence is broken down into query, key, and value and projected onto the attention layer.

The concept of query, key, and value in multi-head attention

Word vectors are linearly projected into the next layer, the multi-head attention. Each head in this mechanism divides the sentence into three parts: query, key, and value. This is the sub-calculative layer of attention where all the important operations are performed on the text sequence. 

Did you know?  The total vector dimension of a BERT model is 768. Like other models, the transformers convert input into vector embeddings of dimension 512.

Query and key undergo a dot product matrix multiplication to produce a score matrix. The score matrix contains the “tensor or weights” distributed to each word as per its influence on input.

The weighted attention matrix does a cross-multiplication with the "value" vector to produce an output sequence. The output values indicate the placement of subjects and verbs, the flow of logic, and output arrangements. 

However, multiplying matrices within a neural network may cause exploding gradients and residual values. To stabilize the matrix, it’s divided by the square root of the dimension of the queries and keys. 

4. Softmax layer

The softmax layer receives the attention scores and compresses them between values 0 to 1. This gives the machine learning model a more focused representation of where each word stands in the input text sequence. 

In the softmax layer, the higher scores are elevated, and the lower scores get depressed. The attention scores [Q*K]  are multiplied with the value vector [V]  to produce an output vector for each word. If the resultant vector is large, it is retained. If the vector is tending towards zero, it is drowned out.

5. Residual and layer normalization

The output vectors produced in the softmax layers are concatenated to create one single resultant matrix of abstract representations that define the text in the best way. The residual layer eliminates outliers or any dependencies on the matrix and passes it on to the normalization layer. The normalization layer stabilizes the gradients, enabling faster training and better prediction power.


The residual layer thoroughly checks the output transferred by the encoder to ensure no two values are overlapping neural network's activation layer is enabled, predictive power is bolstered, and the text is understood in its entirety. 

Tip: The output of each sublayer (x) after normalization is = Layernorm (x+sublayer(x)), where the sublayer is a function implemented within the normalization layer.

6. Feedforward neural network

The feedforward layer receives the output vectors with embedded output values. It contains a series of neurons that take in the output and then process and translate it. As soon as the input is received, the neural network triggers the ReLU activation function to eliminate the “vanishing gradients” problem from the input. 

This gives the output a richer representation and increases the network’s predictive power. Once the output matrix is created, the encoder layer passes the information to the decoder layer.

Did you know? The concept of attention was first introduced in recurrent neural networks and long short-term memory (LSTM) to add missing words to an input sequence. Even though they were able to produce accurate words, they couldn’t conduct the language operations through parallel processing, regardless of computational resource access.

Benefits of encoders

Some companies already utilize a double-stacked version of the transformer’s encoder to solve their language problems. Given their humongous language datasets, encoders work phenomenally well in language translation, question answering, and fill-in-the-blanks. 

Besides language translation, encoders work well in language translation and text summaries. Companies like AstraZeneca use encoder-only architecture like molecular AI to study protein structures like amino acids. It is used to study how trypsin, pepsin, and amylase affect the immunity mechanism of humans. 

Other benefits include:

  • Masked language modeling:  Encoders can derive context from previous words in a sentence to identify missing words. Gated RNNs and LSTMs have a shorter reference window, which prevents them from flowing backward and learning the importance of certain words. But encoders use the concept of “backpropagation” to understand words and produce output.
  • Bidirectional: Not only does the encoder derive meaning from the generated word, it also tends to all the words and their contextual meaning in relation to the current word. This makes encoders better than RNNs and LSTMs, which are unidirectional feedforward models.
  • Sequence classification: Encoders can process sequence transduction, sequence-to-sequence, word-to-sequence, and sequence-to-word problems. It maps the input sequence to a numerical representation to classify the output.
  • Sentiment analysis: Encoders are great for sentiment analysis, as they can encode the emotion from the input text and classify it as positive, negative or neutral. 

As the encoder processes and computes its share of input, all the learned information is then passed to the decoder for further analysis.

What is a decoder?

The decoder architecture contains the same number of sublayer operations as the encoder, with a slight difference in the attention mechanism. Decoders are autoregressive, which means it only looks at previous word tokens and previous output to generate the next word.

Let's look at the steps a decoder goes through.

  • Positional embeddings: The decoder takes the input generated by the encoder and previous output tokens and converts them into abstract numeric representations. However, this time, it only converts words until time series t -1, with t being the current word.
  • Masked multi-head attention 1: To further prevent decoders from processing future tokens, it undergoes the first layer of masked attention. In this layer, attention scores for decoders are calculated and multiplied by a masked matrix that contains a value between 0 and infinity.
  • Softmax layer: After multiplication, the output gets passed on to the softmax layer, which downsizes it and stabilizes the numbers. All the parts of the matrix that belonged to future words are zeroed out. The masked matrix is structured in such a way that negative infinities get multiplied only by future tokens, which are nullified by the softmax layer.
  • Masked multi-head attention 2: In the second masked self-attention layer, the value and keys of the encoder output are compared with the decoder output query to get the best output path.
  • Feedforward neural network: Between these self-attention layers, a residual feedforward network exists to identify missing gradients, eliminate residue, and train the neural network on the data.
  • Linear classifier: The last linear classifier layer predicts the best class of output and processes it word by word.
While shifting data from encoders to decoders, the transformer model loses some of its performance. The additional GPU consumption and memory stress make the decoder less functional but more stable. 

Benefits of decoders

Unlike encoders, decoders do not traverse the left and right parts of sentences while analyzing the output sequence. Decoders handle the previous encoder input and decoder input and then weigh the attention parameters to generate the final output. For all the other words in the sentence, the decoder adds a mask layer so that their value reduces to zero.

  • Unidirectional: Decoders look afterwords in the left direction of a particular word at time step t-1. They are unidirectional and don’t have anything to do with future words. For example, while changing “How are you” into “I am fine,” the decoder uses masked self-attention to cancel out words falling after the t-1 time step, so the word “am” would only have access to itself and the word before “I."
  • Excellent text generation and translation: Decoders can create text sequences from a query or a sentence. Open AI’s generative pre-trained transformers like  GPT-2 and GPT-Neo are based on decoder mechanisms that use input text to predict the second-best word.
  • Casual language modeling: Decoders can tokenize plain textual datasets and predict newer or missing words. It derives context from the already existing tokens on the left and uses that probability distribution to hypothesize the next sensible word in a sentence.
  • Natural language generation (NLG): Decoder mechanisms are used in NLG models to build dialogue-based narratives on an input dataset. Microsoft’s Turing-NLG is an example of a decoder transformer. It is being used to develop dialogue-based conversational abilities in humanoids like Sophia.

Despite decoders being used for building AI text generators and sequence transduction, its unidirectional way of interpreting words results in a loss of performance and accuracy.

What is casual language modeling?


Casual language modeling is an AI technique that predicts the token that follows sequential transduction. It attends to the left side of tokens that are unmasked during linear classification. This technique is mainly used in natural language generation or language translation.


What is a self-attention mechanism?

A self-attention mechanism is a technique that retains information inside a neural network about a particular token or sentence. It draws global dependencies between the input and the output of a transformer model.

For example, consider this sentence:

"No need to bear the brunt of your failures"


“I think I saw a polar bear rolling in the snow."

A simple neural network like RNN or LSTM wouldn’t be able to differentiate between these two sentences and might translate them in the same way. It takes proper attention to understand how the word “bear” affects the rest of the sentence. For instance, the word “brunt” and “failure” can help a model understand the contextual meaning of the word “bear” in the first sentence. The phenomenon of a model “tending to” certain words in the input dataset to build correlations is called "self-attention". 

This concept was brought to life by a team of researchers at Google and the University of Toronto through a paper, Attention is All You Need, led by Ashish Vaswvani and a team of 9 researchers. The introduction of attention made sequence transduction simpler and faster. 

The original sentence in the research paper “Attention is all you need” was:

The agreement on the European economic area was signed in August 1992. 

In the French language, word order matters and cannot be shuffled around. The attention mechanism allows the text model to look at every word in the input while delivering its output counterparts. Self-attention is an NLP technique that maintains a rhythm of input sentences in the output.


While converting the above sentence, the text model looks at economics and European to pick out the correct French word, “Européene.” Also, the model understands that the word Européene needs to be masculine to match with le zone.

RNNs vs. LSTMs vs. Transformers

The gaps and inconsistencies in RNNs and LSTMs led to the invention of transformer neural networks. With transformers, you can trace memory locations and recall words with less processing power and data consumption.

rnn vs lstm vs transformer

Recurrent neural networks, or RNNs, work on a recurrent word basis. The neural network served as a queue where each word of input was assigned to a different function. The function would work on words and change the meaning while transferring this information to the decoder. 

The model worked successfully on shorter-length sentences, but it failed drastically when the sentence became too information-heavy or site-specific.

Long short-term memory (LSTM) models tried to eliminate the problem with RNNs by implementing a cell state. The cell state retained information from the input and tried to map it in the decoding layer of the model. It performed minor multiplication in the cell state to eliminate irrelevant values and had a longer memory window.

Transformers use a stacked encoder-decoder architecture to form the best representation of the input. It enables the decoder to remember which number representations were used in the input through query, key, and value. Further, the attention mechanism draws inferences from previous words to logically place words in the final sentence.

Future of transformers

In the future, transformers will be trained on billions or trillions of parameters to automate language generation with 100% accuracy. It’ll use concepts like AI sparsity and a mixture of experts to infuse models with self-awareness capabilities, thereby reducing the hallucination rate. Future transformers will work on an even more refined form of attention technique. 

Some transformers like BLOOM and GPT 4 are already being used globally. You can find it in intelligence bureaus, forensics, and healthcare.  Advanced transformers are trained on a slew of data and industrial-scale computational resources. Slowly and gradually, the upshot of transformers will change how every major industry functions and build resources intrinsic to human survival.

A transformer also parallelizes well, which means you can operationalize the entire sequence of input operations in parallel through more data and GPUs. 

Transformer models: Frequently asked questions (FAQs)

What is dependency?

Long-term or short-term dependencies mean how much the neural network remembers what happened in the previous input layer and can recall it in the next layer. Neural networks like transformers build global dependencies between data to trace their way back and compute the last value. A transformer relies entirely on an attention mechanism to draw dependencies from an input dataset through numbers.

What is a time step?

A time step is a way of processing your data at regular intervals. It creates a memory path for the user wherein they can allot specific positions to words of the text sequence.

What is an autoregressive model?

Autoregressive or unidirectional models forecast future variables based on previous  variables only. This only happens when there’s a correlation in a time series at the preceding step and the succeeding step. They don’t take anything else into consideration except the right-side values in a sentence and their calculative outputs to predict the next word.   

What is the best transformer model?

Some of the best transformer models are BERT, GPT-3, DistilBERT, CliniBERT, RoBERTa, T5 (text-to-text transformer model), Google MUM, and MegaMOIBART by AstraZeneca.

Which transformer is the largest size?

Megatron is an 8.3 billion parameter large language model, the biggest to date. It has an 8-sub-layered mechanism and is trained on 512 GPUs (Nvidia’s Tesla V100).

Where are transformer models used?

Transformer models are used for critical tasks like making antidotes, drug discoveries, building language intermediates, multilingual AI chatbots, and audio processing.

“Attention” is the need of the hour

Neural network algorithms are cutting through the traffic of traditional ways of computing data. With the advent of CNNs and transformer models, the needle has been moving toward AI to a considerable extent. It’s only a matter of time before transformers will utilize renewable energy sources to generate bouts of relevant data bringing amazing outcomes for all of us.

The era of artificial intelligence has officially begun. Learn how generative AI is helping professionals make smarter decisions and save more money.

artificial neural network Doctor strange in the AI multiverse

Unravel your content multiverse by building intelligent interfaces and generating realistic sentences through faster processing of artificial neural network software. Learn more.

What Is A Transformer Model And How Does It Work? The transformer model is a neural network that is used for language translation, text summarization, and text generation. Learn about its different types. https://learn.g2.com/hubfs/G2CM_FI660_Learn_Article_Images_%5BTransformer_models%5D_V1b.png
Shreya Mattoo Shreya Mattoo is a Content Marketing Specialist at G2. She completed her Bachelor's in Computer Applications and is now pursuing Master's in Strategy and Leadership from Deakin University. She also holds an Advance Diploma in Business Analytics from NSDC. Her expertise lies in developing content around Augmented Reality, Virtual Reality, Artificial intelligence, Machine Learning, Peer Review Code, and Development Software. She wants to spread awareness for self-assist technologies in the tech community. When not working, she is either jamming out to rock music, reading crime fiction, or channeling her inner chef in the kitchen. https://learn.g2.com/hubfs/Copy%20of%20G2%20Image%20(1).png https://www.linkedin.com/in/shreya-mattoo-a20674170/