Back to Skills

transformers

verified

Transformer architecture fundamentals. Covers self-attention mechanism, multi-head attention, feed-forward networks, layer normalization, and residual connections. Essential concepts for understanding LLMs.

View on GitHub

Marketplace

bazzite-ai-plugins

atrawog/bazzite-ai-plugins

Plugin

bazzite-ai-jupyter

development

Repository

atrawog/bazzite-ai-plugins

bazzite-ai-jupyter/skills/transformers/SKILL.md

Last Verified

January 21, 2026

Install Skill

Select agents to install to:

Scope:
npx add-skill https://github.com/atrawog/bazzite-ai-plugins/blob/main/bazzite-ai-jupyter/skills/transformers/SKILL.md -a claude-code --skill transformers

Installation paths:

Claude
.claude/skills/transformers/
Powered by add-skill CLI

Instructions

# Transformer Architecture

## Overview

The Transformer architecture is the foundation of modern LLMs. Understanding its components helps with fine-tuning decisions, model selection, and debugging performance issues.

## Quick Reference

| Component | Purpose |
|-----------|---------|
| Self-Attention | Learn relationships between tokens |
| Multi-Head Attention | Multiple attention perspectives |
| Feed-Forward Network | Transform representations |
| Layer Normalization | Stabilize training |
| Residual Connections | Enable deep networks |

## Self-Attention Mechanism

### Concept

Self-attention allows each token to attend to all other tokens in a sequence, learning contextual relationships.

```
"The cat sat on the mat"
       ↓
  Each word attends to every other word
       ↓
  Contextual representations
```

### Implementation

```python
import torch
import torch.nn.functional as F

# Example tokens
tokens = ["The", "cat", "sat", "on", "the", "mat"]
seq_length = len(tokens)
embed_dim = 8

# Random embeddings (in practice, learned)
embeddings = torch.randn(seq_length, embed_dim)

# Query, Key, Value weight matrices
W_q = torch.randn(embed_dim, embed_dim)
W_k = torch.randn(embed_dim, embed_dim)
W_v = torch.randn(embed_dim, embed_dim)

# Compute Q, K, V
Q = embeddings @ W_q  # Queries: what am I looking for?
K = embeddings @ W_k  # Keys: what do I contain?
V = embeddings @ W_v  # Values: what information do I provide?

# Attention scores
scores = Q @ K.T / (embed_dim ** 0.5)  # Scale by sqrt(d_k)

# Softmax for attention weights
attention_weights = F.softmax(scores, dim=-1)

# Weighted sum of values
output = attention_weights @ V

print(f"Input shape: {embeddings.shape}")
print(f"Output shape: {output.shape}")
print(f"Attention weights shape: {attention_weights.shape}")
```

### Attention Formula

```
Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) V
```

Where:

- Q = Query matrix
- K = Key matrix
- V = Value matrix
- d_k = Key dimension (for scaling)

## Multi-H

Validation Details

Front Matter
Required Fields
Valid Name Format
Valid Description
Has Sections
Allowed Tools
Instruction Length:
8799 chars