1 minute read

📦 Model Size & Parameters

  • DistilGPT2: ~82M parameters
  • GPT-2 medium: ~345M
  • GPT-2 large: ~762M
  • GPT-2 XL: ~1.5B
  • GPT-3: up to 175B

Parameters are the weights that store learned patterns. More parameters = more capacity for robust knowledge.


🧠 What Does “Encoding Knowledge” Mean?

Models don’t store facts like "Paris → France" as text.
Instead, training shapes weights so that given "Paris is the capital of", the model pushes “France” to the top of the probability list.
That mapping is knowledge encoded in weights.


🎲 Probability Distributions

At every step, the model produces logits → softmax → probabilities over all tokens:

  • France: 0.40
  • United States: 0.20
  • Spain: 0.05

This distribution defines the likelihood of each token.


📉 Flatter vs. Sharper Distributions

  • Sharp distribution → one token dominates (confident, deterministic).
  • Flat distribution → probabilities spread out (less certain).

Smaller models often produce flatter distributions, hence more wandering.


🎲 Sampling & Randomness

  • Greedy decoding: always pick the max → deterministic.
  • Sampling: draw from the distribution → variety each run.

Temperature

Controls sharpness of distribution:

  • Low T (0.2): sharper, more deterministic.
  • High T (1.2): flatter, more exploratory.

👉 So yes: sampling = variety of outcomes.


📏 Where Do Numbers Like 0.4 Come From?

Softmax of logits: pi=ezi∑jezj

Training adjusts weights so correct continuations (“France”) earn higher logits.


🧾 Key Terms

  • Determinism: same input → same output (if no randomness).
  • Low entropy: distribution concentrated on few tokens.
  • Ablation: turning off part of the network (e.g. attention head).
  • c_proj: projection layer that recombines head outputs.
  • Attention head: a channel in self-attention focusing on patterns.
  • Sampling noise: randomness when drawing from probabilities.
  • Randomness in token choice: the dice roll at each step.

⚠️ Ablation is not used in normal text generation. It’s a research tool.


⚡ Examples from Experiments

  • Head ablation: Zero out a head → output changes.
  • Noise injection: Add Gaussian noise → completions differ.
  • Style bias: Boost comma token → more comma-heavy text.
  • Logit lens: Peek at early layer predictions.

These tweaks show which parts of the model matter most.


🗺️ Why Small Models Wander

  • Fewer parameters → less robust encoding.
  • Flatter probability distributions → more uncertainty.
  • Sampling with temperature → amplifies variety.

Put together, small models + sampling = wandering outputs.

Updated: