Attention Mechanism

Attention Pooling

Definition

Dataset: \(m\) tuples of keys and values

\[\mathcal{D} \stackrel{\textrm{def}}{=} \{(\mathbf{k}_1, \mathbf{v}_1), \ldots (\mathbf{k}_m, \mathbf{v}_m)\}\]
Attention Pooling: query \(q\) that operate on (\(k\), \(v\)) pairs

\[\textrm{Attention}(\mathbf{q}, \mathcal{D}) \stackrel{\textrm{def}}{=} \sum_{i=1}^m \alpha(\mathbf{q}, \mathbf{k}_i) \mathbf{v}_i,\]

Attention Pooling by Similarity

../../../../_images/am_img1.svg

Kernel Regression

Non-parametric Model: Captures complex, non-linear relationships.
Focus on Similarity: Kernel shape and bandwidth control how it adapts to local data structure.
Kernel Function: Less sensitive to outliers than parametric models.

Attention Scoring Function

../../../../_images/am_img2.svg

Sequence-to-sequence Model

../../../../_images/am_img3.svg

The Bahdanau Attention Mechanism

../../../../_images/am_img4.svg

Multi-Head Attention

../../../../_images/am_img5.svg

CNNs, RNNs, and Self-Attention

../../../../_images/am_img6.svg

The Transformer Architecture

../../../../_images/am_img7.svg

The Vision Transformer Architecture

../../../../_images/am_img8.svg

Large-Scale Pretraining with Transformers

Encoder-Only

Pretraining BERT

Fine-Tuning BERT
Encoder–Decoder

Pretraining T5

Fine-Tuning T5
Decoder-Only

GPT-2

GPT-3