Attention Mechanism

Attention Pooling

Definition
  1. Dataset: \(m\) tuples of keys and values

    \[\mathcal{D} \stackrel{\textrm{def}}{=} \{(\mathbf{k}_1, \mathbf{v}_1), \ldots (\mathbf{k}_m, \mathbf{v}_m)\}\]
  2. Attention Pooling: query \(q\) that operate on (\(k\), \(v\)) pairs

    \[\textrm{Attention}(\mathbf{q}, \mathcal{D}) \stackrel{\textrm{def}}{=} \sum_{i=1}^m \alpha(\mathbf{q}, \mathbf{k}_i) \mathbf{v}_i,\]

Attention Pooling by Similarity

../../../../_images/am_img1.svg

Attention Scoring Function

../../../../_images/am_img2.svg

Sequence-to-sequence Model

../../../../_images/am_img3.svg

The Bahdanau Attention Mechanism

../../../../_images/am_img4.svg

Multi-Head Attention

../../../../_images/am_img5.svg

CNNs, RNNs, and Self-Attention

../../../../_images/am_img6.svg

The Transformer Architecture

../../../../_images/am_img7.svg

The Vision Transformer Architecture

../../../../_images/am_img8.svg

Large-Scale Pretraining with Transformers

  1. Encoder-Only

    Pretraining BERT

    ../../../../_images/am_img9.svg

    Fine-Tuning BERT

    ../../../../_images/am_img10.svg
  2. Encoder–Decoder

    Pretraining T5

    ../../../../_images/am_img11.svg

    Fine-Tuning T5

    ../../../../_images/am_img12.svg
  3. Decoder-Only

    GPT-2

    ../../../../_images/am_img13.svg

    GPT-3

    ../../../../_images/am_img14.svg