Transformer
Word Embedding - nn.Embedding
- Embedding Dim
The dimensionality of the vector representation for each word/token. This determines how much information can be encoded in that single vector.
- Hidden Dim
The dimensionality of the hidden states within the transformer’s attention layers. This represents the model’s internal working memory and capacity to process relationships between words in the sequence.
- Padding Idx
During text processing, sequences often need to be padded to a uniform length. The padding_idx identifies which index within your word/token vocabulary is the “padding” symbol. The nn.Embedding layer will produce a zero vector (all values are 0) for any word/token whose index matches the padding_idx. This effectively ignores padded elements, preventing them from contributing to your model’s calculations.
Embeddings start with relatively lower dimensionality and are projected into a higher-dimensional space for computation within the attention mechanisms.
Positional Encoding - Custom nn.Module
- Model dim
The dimensionality of the word embeddings for a transformer’s task.
- Dropout
Transformer models initially lack an inherent understanding of word order. Positional encodings are the sole providers of sequential information. This creates the risk of the model overly relying on exact positional representations, potentially hindering generalization.
Positional Embedding - Word embedding + Postional Encoding
embedding = nn.Embedding(vocab_size, d_model)
pos_encoder = PositionalEncoding(d_model)
x = embedding(input_tokens)
x = pos_encoder(x)
input -> embedding layer -> pos_encoding layer -> dropout layer -> output
Self-Attention
The model is attending to different positions within the same input sequence. Query, key, and value vectors for each word/token are derived from the initial word/token embeddings within the same sequence. When using this same sequence as input for nn.MultiheadAttention, the model handles self-attention with multiple heads.
Multi-Head Attention - nn.MultiheadAttention
- num_heads
Number of parallel attention heads. Note that embed_dim will be split across num_heads.
Each head operates on smaller, projected representations of the input. These multiple parallel heads enable the model to focus on different aspects or ‘subspaces’ of the input simultaneously, enriching the representations it learns.
Masked Multi-Head Attention(Decoder) Causal-Attention
- Autoregressive Generation
The decoder generates text (or other sequences) one token at a time. During generation, it must avoid attending to future tokens it hasn’t yet predicted – this would “leak” information and break the logic of the model.
- Causal Generation
Masked multi-head attention enforces a causal language modeling structure, ensuring the decoder’s prediction at each step relies only on previous tokens.
Encoder-Decoder Attention(Decoder) Cross-Attention
- Encoder Outputs
The final output of the encoder contains rich contextual representations for each token in the input sequence. These outputs serve as the keys (K) and values (V) for the decoder’s multi-head attention.
- Decoder Self-Attention Outputs
The output of the decoder’s first masked multi-head attention layer. This carries information about what the decoder has generated so far. It serve as the Query (Q) for decoder’s second multi-head attention.
Residual Connection
The output of the multihead attention will be added back to this original input before proceeding to the next layer. This helps stabilize training and allows information to flow more easily through gradients.
Layer Normalization
- Focus on a Single Sample
Layer Normalization operates independently on each data point (e.g., each image or each sentence) within a batch.
- Element-Wise Standardization
Element is feature or dimension. By normalizing across features within each data point, LN ensures that no single feature dominates the subsequent calculations in the layer. This creates better stability during training.
- Batch Independence
Since each sample is normalized independently, the input sequence length or varying batch sizes won’t destabilize Layer Normalization.
Caculate mean, caculate variance and Normalization.
Positionwise Feed-Forward Network
- Position-Wise
Processes each token’s embedding vector independently.
- Non-Linear Transformation
While self-attention excels at capturing relationships within a sequence, it’s inherently linear. The FFN adds non-linearity to the model, allowing it to learn more complex transformations on the representations produced by self-attention.
- Feature Processing
The FFN acts independently upon each token’s representation from the preceding self-attention layer. This means it can refine features within each position of the sequence.
- Feed-Forward Network
The FFN in a Transformer operates independently on the representation of each token in the sequence. You can think of each token having its own tiny, private feed-forward network.
- Sequential Linear Transformations
The core of the FFN comprises two linear (fully-connected) layers. Data for each token undergoes these transformations sequentially.
Targets Shifted Right
During training, the decoder needs to know the correct word to predict and the words that came before it. Shifting the target sequence right by one position and inserting a “start” token achieves this.