Module 2: How Does Temperature Work?
The Experimenter — Find the phase transition
Duration: 60 min | Difficulty: Beginner | Prerequisite: Module 1
The Aha Moment
Temperature isn't a creativity dial — it's a probability redistribution. And it doesn't degrade gradually. There's a cliff.
Students discover that temperature controls a mathematical transformation of the probability distribution over the vocabulary. Small changes near the cliff produce dramatic quality shifts — a phase transition, not a linear scale.
Conceptual Background
What temperature actually does
After the model computes a score (logit) for every token in the vocabulary, these scores are passed through a softmax function to produce probabilities. Temperature modifies this softmax:
Standard softmax:
P(token_i) = exp(z_i) / sum_j exp(z_j)Softmax with temperature T:
P(token_i) = exp(z_i / T) / sum_j exp(z_j / T)The effect:
- T < 1 — Divides logits by a number less than 1, making large logits even larger relative to small ones. The distribution becomes sharper (more peaked). The top token dominates.
- T = 1 — No modification. The distribution is as the model computed it.
- T > 1 — Divides logits by a number greater than 1, compressing all logits toward zero. The distribution becomes flatter (more uniform). All tokens become roughly equally likely.
- T → 0 — The distribution collapses to a single point. Only the highest-logit token has non-zero probability. This is greedy decoding.
Why this matters
Temperature is the most commonly adjusted parameter when using LLMs, but it's widely misunderstood. People describe it as "creativity vs accuracy" — but that's a simplification. What it actually controls is the entropy of the sampling distribution.
Low entropy = predictable, repetitive output. High entropy = diverse, surprising, but potentially nonsensical output. The right setting depends entirely on the task.
The phase transition
Unlike a volume dial that smoothly goes from quiet to loud, temperature exhibits a phase transition. Output quality stays high across a wide range (T=0 to T≈0.8-1.0), then drops off sharply in a narrow band. This is because:
- At low temperatures, the top token has such high probability that sampling is nearly deterministic regardless
- At moderate temperatures, the top 3-5 tokens share most of the probability — still reasonable choices
- At a critical point, enough probability leaks to implausible tokens that the model starts generating incoherent text
- Beyond the cliff, output becomes essentially random
Other sampling methods
Temperature is not the only way to control token selection. Modern LLMs support several sampling strategies that can be combined:
| Method | What it does | When to use |
|---|---|---|
| Greedy (T=0) | Always picks the highest-probability token | Factual answers, deterministic output |
| Temperature | Reshapes the probability distribution | General-purpose control of output variety |
| Top-k | Samples from only the k highest-probability tokens | Simple diversity control, fixed candidate set |
| Top-p (nucleus) | Samples from the smallest set of tokens whose cumulative probability exceeds p | Adaptive diversity — more options when uncertain, fewer when confident |
| Min-p | Removes tokens with probability below a fraction of the top token | Newer alternative to top-p, more intuitive threshold |
| Mirostat | Dynamically adjusts sampling to maintain a target perplexity | Consistent "surprise level" regardless of context |
| Repetition penalty | Reduces probability of recently generated tokens | Prevents loops and repetitive text |
These methods compose
In practice, multiple methods are applied in sequence: temperature first (reshapes distribution), then top-k or top-p (truncates the distribution), then sampling from what remains. LLMxRay's Compare feature lets you test different combinations side by side.
Hands-On Exercises
Exercise 1: The temperature sweep
What to do:
- Open Compare in LLMxRay
- Click Temperature Sweep preset — this creates 3 slots with temperatures 0.2, 0.7, and 1.2
- Add a 4th slot manually and set it to 2.0
- Use this prompt: "Write a function in Python to check if a number is prime"
- Click Run All and watch all 4 generate simultaneously
- Compare the results in Grid View
What to observe:
- T=0.2: Clean, standard implementation. Almost identical if you run it again.
- T=0.7: Slight variations in variable names or comments. Still correct.
- T=1.2: More creative approaches (maybe a different algorithm), but watch for subtle bugs.
- T=2.0: Variable names become strange, logic errors appear, possibly incomplete.
Switch to Diff View to see exactly which words changed between the outputs.
Key question: Between which two temperatures does the quality drop feel sharpest?
Exercise 2: Finding the cliff
What to do:
- Stay in Compare. Set up 4 slots with temperatures: 0.7, 0.9, 1.1, 1.3
- Prompt: "Write a function in Python to check if a number is prime"
- Run 3 times. For each run, mark whether each slot produced correct code (Yes/No)
- Record your results in a table:
| Temperature | Run 1 | Run 2 | Run 3 | Correct rate |
|---|---|---|---|---|
| 0.7 | /3 | |||
| 0.9 | /3 | |||
| 1.1 | /3 | |||
| 1.3 | /3 |
- The cliff is where the correct rate drops sharply — usually between T=0.9 and T=1.2
Why this matters:
This demonstrates that temperature isn't a linear trade-off. There's a narrow band where output goes from "almost always correct" to "usually wrong." Finding this cliff for your specific model and task is one of the most practical skills in prompt engineering.
Exercise 3: Temperature vs task type
What to do:
- Set up 2 slots: both using the same model, one at T=0.2, the other at T=1.0
- Factual prompt: "What is the capital of France?"
- Run it. Both should say "Paris." Low temperature adds nothing here.
- Creative prompt: "Write a haiku about debugging code at midnight"
- Run it 3 times. Compare the variety.
- At T=0.2, you'll get nearly the same haiku every time.
- At T=1.0, you'll get genuinely different creative expressions.
- Reasoning prompt: "A farmer has 15 sheep. All but 8 run away. How many sheep does the farmer have left?"
- This is a trick question (answer: 8, not 7). Test at both temperatures.
- Does temperature affect reasoning accuracy?
Discussion: Why is there no single "best" temperature? What temperature would you choose for:
- A customer support chatbot?
- A creative writing assistant?
- A code completion tool?
- A medical information system?
Exercise 4: Determinism and seeds
What to do:
- In Compare, use the Deterministic Pair preset — two slots with the same model, same settings, same seed, T=0
- Run the same prompt 3 times
- All outputs should be identical — the seed makes the random number generator reproducible
- Now change one slot to T=0.7 (keep the same seed)
- Run again 3 times — outputs will now differ between runs
What this reveals:
- At T=0 (greedy), the seed doesn't matter — there's no randomness to control
- At T>0, the seed controls which random path is taken through the distribution
- Same seed + same temperature = reproducible "randomness"
- This is how researchers ensure experiment reproducibility while still using stochastic decoding
Why reproducibility matters
In research and production, you need to reproduce specific outputs for debugging, comparison, and audit. Setting a fixed seed with a moderate temperature gives you controlled variety — different from greedy (always the same) but reproducible when needed.
Key Takeaways
- Temperature is a mathematical operation — it divides logits by T before softmax, reshaping the probability distribution
- The cliff is real — quality doesn't degrade linearly; there's a sharp transition where output goes from reliable to chaotic
- There's no universal best temperature — the optimal value depends on the task (factual, creative, reasoning)
- Temperature composes with other methods — top-k, top-p, and repetition penalty work alongside temperature
- Seeds enable reproducibility — fixed seed + fixed temperature = same output every time
Discussion Questions
- If a company deploys a chatbot at T=0 for safety, what do they lose? Is deterministic output always "safer"?
- The cliff position varies by model. Why might a 70B model have a higher cliff temperature than a 3B model?
- Creative writing assistants often use T=0.8-1.0. But whose creativity is it — the user's or the model's? Does temperature change this?
- If you could only adjust one parameter (temperature, top-k, or top-p), which would you choose and why?
Further Reading
Academic Papers
| Paper | Authors | Year | Link |
|---|---|---|---|
| Distilling the Knowledge in a Neural Network | Hinton, Vinyals, Dean | 2015 | arXiv:1503.02531 |
| The Curious Case of Neural Text Degeneration | Holtzman, Buys, Du, Forbes, Choi | 2019 | arXiv:1904.09751 |
| Hierarchical Neural Story Generation | Fan, Lewis, Dauphin | 2018 | arXiv:1805.04833 |
| Mirostat: A Neural Text Decoding Algorithm | Basu, Ramachandran, Keskar, Varshney | 2020 | arXiv:2007.14966 |
| Phase Transitions in the Output Distribution of LLMs | — | 2024 | arXiv:2405.17088 |
| Turning Up the Heat: Min-p Sampling | — | 2024 | arXiv:2407.01082 |
| The Effect of Sampling Temperature on Problem Solving in LLMs | Renze, Guven | 2024 | arXiv:2402.05201 |
| CTRL: Conditional Transformer for Controllable Generation | Keskar, McCann, Varshney, Xiong, Socher | 2019 | arXiv:1909.05858 |
Tutorials and Explanations
| Resource | Author | Link |
|---|---|---|
| The Unreasonable Effectiveness of Recurrent Neural Networks | Andrej Karpathy | karpathy.github.io |
| Generation Configurations: Temperature, Top-k, Top-p | Chip Huyen | huyenchip.com |
| Controllable Neural Text Generation | Lilian Weng | lilianweng.github.io |
| Sampling Parameters Explained: Intuition to Math | Let's Data Science | letsdatascience.com |
| Decoding Strategies in Large Language Models | mlabonne (HuggingFace) | huggingface.co/blog |
Interactive Tools
| Tool | Link | What it does |
|---|---|---|
| Transformer Explainer | poloclub.github.io | Full transformer visualization with live temperature slider |
| LLM Sampling Visualizer | louis-7.github.io | Adjust temperature, top-k, top-p and see distributions change |
| Temperature & Top-k Visualizer | andreban.github.io | Focused demo for temperature and top-k effects |
Key Concepts
The origin of softmax temperature comes from Hinton et al. (2015), who introduced it for knowledge distillation — transferring knowledge from a large model to a small one by "softening" the probability distribution. The same parameter was later adopted for controlling text generation diversity.
Nucleus sampling (top-p) was introduced by Holtzman et al. (2019) as a response to the observation that standard sampling with temperature produces either too-generic (low T) or too-random (high T) text. Their insight: instead of a fixed temperature, truncate the distribution to the smallest set of tokens covering a cumulative probability threshold. This adapts automatically — when the model is confident, fewer tokens are considered; when uncertain, more options remain.
The phase transition is real, not a metaphor. Research by multiple teams (arXiv:2405.17088) using methods from statistical physics has shown that LLM output undergoes genuine phase transitions at specific temperature thresholds — with divergent statistical quantities at the transition points, just like phase transitions in physical systems.
Mirostat (Basu et al., 2020) takes a different approach by targeting a specific perplexity level rather than a fixed distribution shape. It dynamically adjusts the sampling to maintain consistent "surprise" regardless of whether the model is generating a predictable phrase or navigating uncertain territory.
Assessment
Option A — Data collection (individual, 1 page): Run the cliff-finding experiment (Exercise 2) with 5 temperature values and 5 runs each. Present a table and line chart of correctness rate vs temperature. Identify the cliff point for your model.
Option B — Comparative analysis (pairs, 1 page): Test the same prompt across 3 task types (factual, creative, reasoning) at 4 temperatures. For each combination, rate the output quality on a 1-5 scale. Present a heatmap and recommend optimal temperatures per task type.
Option C — Technical explanation (individual, 500 words): Explain to a non-technical product manager why their chatbot should NOT use T=0, despite it being "the safest option." Use specific examples from your experiments.
What's Next
In Module 3: Can AI Lie?, you'll use what you learned about probability distributions to understand confidence vs truth. A model can assign 95% probability to the wrong answer — and you'll discover why through benchmarks with real logprobs.
Module 2 of 8 in the LLMxRay Educators Kit ← Module 1: What Is a Token? | Back to Curriculum | Module 3: Can AI Lie? →