Transformer architecture fundamentals. Covers self-attention mechanism, multi-head attention, feed-forward networks, layer normalization, and residual connections. Essential concepts for understanding LLMs.
View on GitHubatrawog/bazzite-ai-plugins
bazzite-ai-jupyter
bazzite-ai-jupyter/skills/transformers/SKILL.md
January 21, 2026
Select agents to install to:
npx add-skill https://github.com/atrawog/bazzite-ai-plugins/blob/main/bazzite-ai-jupyter/skills/transformers/SKILL.md -a claude-code --skill transformersInstallation paths:
.claude/skills/transformers/# Transformer Architecture
## Overview
The Transformer architecture is the foundation of modern LLMs. Understanding its components helps with fine-tuning decisions, model selection, and debugging performance issues.
## Quick Reference
| Component | Purpose |
|-----------|---------|
| Self-Attention | Learn relationships between tokens |
| Multi-Head Attention | Multiple attention perspectives |
| Feed-Forward Network | Transform representations |
| Layer Normalization | Stabilize training |
| Residual Connections | Enable deep networks |
## Self-Attention Mechanism
### Concept
Self-attention allows each token to attend to all other tokens in a sequence, learning contextual relationships.
```
"The cat sat on the mat"
↓
Each word attends to every other word
↓
Contextual representations
```
### Implementation
```python
import torch
import torch.nn.functional as F
# Example tokens
tokens = ["The", "cat", "sat", "on", "the", "mat"]
seq_length = len(tokens)
embed_dim = 8
# Random embeddings (in practice, learned)
embeddings = torch.randn(seq_length, embed_dim)
# Query, Key, Value weight matrices
W_q = torch.randn(embed_dim, embed_dim)
W_k = torch.randn(embed_dim, embed_dim)
W_v = torch.randn(embed_dim, embed_dim)
# Compute Q, K, V
Q = embeddings @ W_q # Queries: what am I looking for?
K = embeddings @ W_k # Keys: what do I contain?
V = embeddings @ W_v # Values: what information do I provide?
# Attention scores
scores = Q @ K.T / (embed_dim ** 0.5) # Scale by sqrt(d_k)
# Softmax for attention weights
attention_weights = F.softmax(scores, dim=-1)
# Weighted sum of values
output = attention_weights @ V
print(f"Input shape: {embeddings.shape}")
print(f"Output shape: {output.shape}")
print(f"Attention weights shape: {attention_weights.shape}")
```
### Attention Formula
```
Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) V
```
Where:
- Q = Query matrix
- K = Key matrix
- V = Value matrix
- d_k = Key dimension (for scaling)
## Multi-H