Attention Is All You Need — Notes & Takeaways

If you work in AI — or even adjacent to it — this is one of those papers you should read at least once. Published in 2017 by Vaswani et al. at Google, “Attention Is All You Need” introduced the transformer architecture. It’s the foundation of basically every major language model since: BERT, GPT, PaLM, LLaMA, and everything that followed. The paper was originally about machine translation, but the architecture turned out to be one of the most general-purpose tools in the history of deep learning.

I wanted to write up my notes because even though transformers are everywhere now, it’s worth going back to the source and understanding why this particular design won out.

The Problem with RNNs

Before transformers, the dominant approach for sequence tasks (translation, summarization, language modeling) was recurrent neural networks — RNNs and their variants like LSTMs and GRUs.

The fundamental issue with RNNs is that they process sequences one token at a time, in order. To understand the 50th word in a sentence, you first have to process words 1 through 49. This creates two major problems:

You can’t parallelize training. Each step depends on the previous step’s output, so you’re stuck processing sequentially. GPUs are massively parallel processors — RNNs barely take advantage of them.
Long-range dependencies are hard. Information from early in a sequence has to survive being passed through dozens or hundreds of steps. In practice, it degrades. LSTMs helped with this through gating mechanisms, but they didn’t fully solve it.

The result: RNNs were slow to train and struggled with long sequences. As people wanted to work with longer documents and bigger datasets, this was becoming a real bottleneck.

The Core Idea: Attention Is All You Need

The paper’s central claim is right there in the title — you can get rid of recurrence entirely and rely solely on attention mechanisms. This was a bold move. Attention had been used alongside RNNs (Bahdanau attention, for example), but nobody had tried building an entire architecture out of just attention.

The key insight: you don’t need to process tokens in order to understand the relationships between them. Instead, let every token look at every other token in the sequence, all at once. The model learns which tokens are relevant to which, regardless of how far apart they are in the sequence.

This is self-attention — the mechanism at the heart of the transformer.

How It Works (Intuitively)

The best analogy I’ve seen for self-attention is information retrieval. Imagine you’re in a library:

A query is the question you’re asking: “What’s relevant to understanding this word?”
Keys are labels on each book (token) describing what information it contains.
Values are the actual contents of the book — the information you get back.

For each token in a sequence, the model generates a query, a key, and a value. The query is compared against all keys to figure out which tokens are most relevant, and then the corresponding values are combined (weighted by relevance) to produce the output.

The clever part is multi-head attention. Instead of asking one question, the model asks several different questions in parallel — each “head” can focus on a different type of relationship. One head might track syntactic dependencies, another might track semantic similarity, another might focus on positional proximity. The outputs are concatenated and combined.

This is what lets transformers be so expressive. They’re not just finding one type of pattern — they’re finding many patterns simultaneously.

Positional Encoding

Here’s the trade-off of dropping recurrence: the model has no inherent notion of order. The sentence “the cat sat on the mat” and “mat the on sat cat the” would look identical to pure self-attention — it just sees a bag of tokens.

The fix is positional encodings. The paper uses sinusoidal functions at different frequencies to generate a unique position signal for each spot in the sequence. These encodings are added to the token embeddings before they enter the model, giving it a sense of “this token is at position 5, that token is at position 12.”

It’s a simple solution, and it works. Later work explored learned positional embeddings and relative position encodings (like RoPE, which is now standard in most LLMs), but the core idea remains the same: you need to inject position information explicitly because the architecture doesn’t have it built in.

Encoder-Decoder Structure

The original transformer has two halves:

The encoder reads the full input sequence and produces a rich representation of it. It’s a stack of layers, each containing multi-head self-attention followed by a feedforward network.
The decoder generates the output sequence one token at a time, attending both to the encoder’s output (cross-attention) and to the tokens it has already generated (masked self-attention — it can’t peek at future tokens).

This was designed for translation: the encoder reads the source language, the decoder writes the target language. But what’s remarkable is how well the individual halves work on their own. BERT is essentially just the encoder. GPT is essentially just the decoder. The modular nature of the design turned out to be one of its greatest strengths.

Why It Changed Everything

The transformer didn’t just improve on RNNs — it unlocked an entirely different scaling paradigm:

Parallelization. Self-attention processes all tokens simultaneously. Training became dramatically faster, and it actually uses GPU hardware the way it was designed to be used.
Scaling. It turned out that making transformers bigger (more layers, more heads, more parameters) reliably made them better. This wasn’t obvious — bigger RNNs didn’t always improve. But transformers exhibited clean scaling laws, which gave researchers a clear path forward: get more data, get more compute, get better models.
Generality. The architecture wasn’t specific to translation, or even to language. Transformers now power text generation (GPT), image recognition (ViT), protein structure prediction (AlphaFold), code generation, music generation, and more.

The paper’s eight authors probably didn’t predict that their translation architecture would become the backbone of systems that write code, generate images, fold proteins, and have open-ended conversations. But that’s what happened.

What I Found Interesting

A few things stand out to me re-reading this paper:

The simplicity of the core mechanism. Self-attention is conceptually straightforward — it’s a weighted combination of values based on query-key similarity. The power comes from stacking it, running it in parallel heads, and training it at scale. There’s no exotic math here. The architecture is almost suspiciously simple for how much it can do.

The title is a thesis statement. “Attention Is All You Need” isn’t just a catchy name — it’s a direct argument. The authors are saying: you can throw away recurrence, you can throw away convolutions, and attention alone is sufficient. That was a strong claim in 2017, and it turned out to be largely correct.

It’s a translation paper that launched the AI era. The paper benchmarks on WMT translation tasks. That’s it. There’s no language modeling, no chatbot, no code generation. But the architecture was so general that it became the universal substrate for AI. The gap between “good BLEU scores on English-to-German” and “powers every major AI system” is one of the most surprising capability jumps in computing history.

Open Questions

Even with transformers dominating, there are real open problems:

Quadratic attention cost. Self-attention scales quadratically with sequence length — every token attends to every other token. This makes very long contexts expensive. There’s been a lot of work on efficient attention (sparse attention, linear attention, FlashAttention for hardware optimization), but it’s still a fundamental constraint.
Is attention actually all you need? State-space models like Mamba offer a different approach — they can handle very long sequences more efficiently and don’t have the quadratic bottleneck. Hybrid architectures mixing attention with SSM layers are showing promise. The “all you need” in the title might turn out to be an overstatement in the long run.
Context length vs. true understanding. Transformers can now handle very long contexts, but there’s an ongoing debate about whether attending to 100K+ tokens actually means the model uses all that information effectively. Retrieval-augmented approaches might complement or partially replace brute-force long context.

The transformer might not be the final architecture. But it’s the one that made everything we’re seeing today possible, and understanding it is table stakes for anyone working in this space.