Back to Blogs

Attention Is All You Need and Why It Changed the Way Machines Learning

Before 2017, models like RNNs and LSTMs processed text one word at a time, which meant: - Slow training - Poor long-term memory -No parallel processing Everything changed by Google 's paper “Attention Is All You Need” - Instead of reading sequentially - The model looks at all words at once and focuses on what matters most using Attention. This idea led to the Transformer, the foundation of modern models like GPT and BERT. In this article, we’ll quickly understand the paper and how Transformers works.

Attention Is All You NeedMachines LearningAITransformers
Attention Is All You Need and Why It Changed the Way Machines Learning

Transformer

The Transformer model is mainly used to understand and generate human language. Older models such as RNNs and LSTMs does not process sentences word by word. It looks at the full sentence at once using attention. This helps the model understand context better and also makes training faster.

Transformer Model Architecture

Transformer has two parts:

  • Encoder -> it understand input sentence
  • Decoder -> it generates te output sentence.it generates new text.

Blog image


Positional Encoding

Transformers do not naturally understand the order of words. To solve this problem, positional encoding is added to word embeddings. This gives the model information about the position of each word in the sentence, so it knows which word comes first and next

Blog image

Why sine and cosine is used ?

  • Unique for every position
  • Generalizes to long sentences
  • Easy for model to learn patterns

Encoder

The encoder is the part of the Transformer that understands the input. It reads the entire sentence at the same time and learns how words are related to each other. Each word can look at all other words in the sentence and decide which ones are important.

Blog image
Encoder layer contain,

  • Self-Attention
  • Feed Forward Neural Network
  • Residual Connection + Layer Normalization

Self-Attention

Self-attention allows words within the same sentence to interact with each other. Self-attention means Q, K, V come from the same sentence.

Blog image

  • Words understand context
  • Long-range dependencies are easy
  • No memory loss like RNNs

Multi-Head Attention

Multiple attention heads are used at the same time. Each head focuses on different aspects of the sentence, such as grammar, meaning, or word distance.

Blog image

Blog image


Decoder

The decoder is responsible for generating the output. It produces the sentence one word at a time. predicting a word, the decoder is not allowed to see future words using masked attention.

Each decoder layer has:

  • Masked Self-Attention
  • Encoder-Decoder Attention
  • Feed Forward Network

Blog image


Softmax

Softmax is used to convert the model’s output into probabilities and choose the most next word.

Blog image

  • It Convert score into probability
  • It Choose the most likely next word

Training

The Transformer model is trained on large datasets using cross-entropy loss and backpropagation. Since the architecture supports parallel processing, training is much faster compared to older models like RNNs and LSTMs.

  • Uses Cross-Entropy Loss
  • Trained on large datasets
  • Parallel processing makes it very fast
  • Backpropagation updates attention weights

Example (Translator)

Input: செயற்கை நுண்ணறிவை பாதுகாப்பாக பயன்படுத்தவும்

Steps:

  1. Encoder understands word relationships
  2. Decoder attends to relevant encoder outputs
  3. Output : Use artificial intelligence safely

Why are Transformers Important ?

Transformers handle long sentences well, train fast and scaleable. They are used in modern AI systems such as ChatGPT, Google Translate, and BERT.


Final Thought

The Transformer model showed that attention alone is enough to understand language. For students learning AI, Machine Learning, or NLP, understanding this model is very important because it forms the foundation of most modern language models today.


References

[1] Vaswani, A., et al. (2017) https://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf

[2] Jay Alammar – The Illustrated Transformer - https://jalammar.github.io/illustrated-transformer/

[3] GeeksforGeeks – Transformer Attention Mechanism https://www.geeksforgeeks.org/transformer-attention-mechanism/


Ask Kowsik

Online

Hi! 👋 I'm Kowsik Y. Ask me anything — about my projects, skills, background, or how to get in touch!