Attention Mechanism
Attention Pooling
- Definition
Dataset: \(m\) tuples of keys and values
\[\mathcal{D} \stackrel{\textrm{def}}{=} \{(\mathbf{k}_1, \mathbf{v}_1), \ldots (\mathbf{k}_m, \mathbf{v}_m)\}\]Attention Pooling: query \(q\) that operate on (\(k\), \(v\)) pairs
\[\textrm{Attention}(\mathbf{q}, \mathcal{D}) \stackrel{\textrm{def}}{=} \sum_{i=1}^m \alpha(\mathbf{q}, \mathbf{k}_i) \mathbf{v}_i,\]
Attention Pooling by Similarity
Attention Scoring Function
Sequence-to-sequence Model
The Bahdanau Attention Mechanism
Multi-Head Attention
CNNs, RNNs, and Self-Attention
The Transformer Architecture
The Vision Transformer Architecture
Large-Scale Pretraining with Transformers
Encoder-Only
Pretraining BERT
Fine-Tuning BERT
Encoder–Decoder
Pretraining T5
Fine-Tuning T5
Decoder-Only
GPT-2
GPT-3