Thermostat

Dynamic Temperature Within a Single LLM Response

TL;DR — I built a loop that lets an LLM ask to change its own temperature mid-response. Not true "intra-call" sampling yet (it stitches across multiple calls), but it works surprisingly well for "brainstorm → spec" style one-shots.

By Jeff Nash·January 19, 2026·12 min read

The Question

The "Thermostat" Experiment

Hey, Jeff here. I've been messing with what is essentially a while loop that simulates dynamic, intra-response temperature changes in an experiment I called Thermostat. Yes, it's a pun. Thermostats change the temperature of the room and my little orchestration loop enables LLMs, with a little elbow grease, to change their own temperature.

If you have a huge one-shot prompt with an ideation phase, a concrete example phase, and a summary phase, you probably would want the ideation phase to be more creative but the example and summary phases to be less so. However, when calling an LLM, each response is generated with a single temperature parameter set by you, the user. This, naturally, raises the question: what if the model itself specified what the optimal temperature was for each section? What if the model could be like Sean Paul, dynamically adjusting its own temperature accordingly as it generated parts of the entire response?

"The way the time cold, I wanna be keepin' you warm.
I got the right temperature fi shelter you from the storm."
— Sean Paul

This idea actually came to me via my wife, who asked me why temperature has to stay the same for an entire response. We were watching yet another video on how LLMs work and they were going over temperature as it related to token prediction likelihood. "Could you change it for different parts?", she asked.

I'd never thought about that. So I built this.

Theory

The Theory: It's Just Math

In many user-friendly guides to AI and 'How LLMs Work' videos, temperature is colloquially referred to as the "creativity" or "randomness" level of the response. From a mathematical perspective, however, it's really just a probability distribution.

When a model's temperature is high, the distribution for the possibilities for each token flattens: more tokens stay in play and are closer in probability to be selected as the next token. When the temperature is low, the distribution becomes more spikey: a few tokens remain extremely likely to be picked, while the likelihood of the others are reduced dramatically. It's important to note that, regardless of temperature, the model still picks exactly one next token at a time and only one token is ultimately selected. Temperature only changes how that choice is distributed among possibilities.

The Autocomplete Metaphor (Beaten to Death Edition)

If you want a less mystical (and overly trite) metaphor: temperature is like a sampling setting on a very fancy autocomplete.

Think of the three predictions for the next word that show up on your iPhone keyboard as you're writing a text.

Low Temperature

We at the

office

company

team

↓ refresh ↓

office

company

team

These three words would consistently be the same, since they are, statistically speaking, much more likely than all other words.

High Temperature

We at the

Motel

Hotel

Holiday Inn

↓ refresh ↓

Ritz

front

beach

This difference is less pronounced — many words are likely enough to make the top 3. So many, in fact, that you might get a different batch every time.

But regardless of temperature, you still only pick one word.

Do You Believe in Magic, in an LLM's Heart?

As it stands, temperature is yet another AI-related concept that is inappropriately anthropomorphized, much to the chagrin of those who beg people to stop treating AI like a sentient friend. And yes, I realize that with explanations like these, I am not helping. So, to maintain my tech credibility, I must emphasize: none of this is magic and turning up the temperature isn't giving your model an 'imagination'. As always, it's just math, all the way down. However, that math does loosely map onto what people might view as "creativity": high temperature increases the chance you sample a less-obvious (but still plausible) continuation instead of always taking the safest or most obvious one.

The Temperature Spectrum

< 0.5

The distribution peaks sharply. The highest logit dominates, leading to deterministic output.

> 1.0

The distribution flattens. Lower-probability tokens gain statistical significance, increasing entropy.

Technical Aside

"Creativity" is what users see. Under the hood, temperature is a divisor inside the softmax that normalizes logits into probabilities[1]:

P(token_i) =exp(logit_i / T)Σexp(logit_j / T)

Where T is temperature and the sum is over all tokens in the vocabulary

Low T (< 0.5): The distribution peaks sharply. The highest logit dominates, leading to deterministic output[2].

High T (> 1.0): The distribution flattens. Lower-probability tokens gain statistical significance, increasing entropy[2].

Softmax & Sampling

T=0.1

T=0.5

T=1.0

T=1.5

Next token for:

The cat sat on the___

Waiting for sample...

0Almost deterministic1

mat

floor

mat

86.6%

floor

11.7%

couch

1.6%

bed

0.1%

table

0.0%

roof

0.0%

piano

0.0%

moon

0.0%

Adjusting distribution...

Sampling = random point in [0,1] hits a segment

That's why higher temperatures feel 'creative': you're giving unlikely tokens a real chance instead of crushing them with the otherwise-unstoppable force of high predictability.

So the question is: can a model change its own temperature mid-response? Could it suddenly self-reflect and decide, "you know what, I am not giving myself enough options in predicting this stuff" or "I'm being too loosy goosey with my words" and adjust accordingly. And yes, I realize models don't 'decide' anything; such a mechanism would, yet again, be the model probabilistically determining to predict some token that corresponds, in our world, to the model's temperature being adjusted.

Design Decisions

Why Stitching?

As far as I know, the way most inference loops are written does not allow for this. So, if you can't do it inside the model... where can you do it? I've already established that temperature is a parameter that amounts to controlling the math that shapes the probability distribution for an entire response, so I'll have to be a bit more creative to answer that.

There are a few facts we must know to even begin to intuit such a solution. Number one: with the OpenAI completions API, you can interrupt a model mid-response, send it another message, and have the model retain the context of what it said so far. Number two: each time you call a model, you can specify any temperature, even if prior tokens in the context were generated at a different temperature. If you put these two facts together, perhaps you see the direction my thinking is heading. When imagining how such a system would work, two approaches immediately sprung to mind.

Top-Down (Multi-Phase)

The model first creates an outline, breaking the response into sections with corresponding temperatures. Then we call the model once per section sequentially.

Step 1: Plan

1.BrainstormingT=1.5

2.Selection of IdeaT=0.8

3.DocumentationT=0.3

4.Edge CasesT=1.25

5.SummaryT=0.25

Step 2: Execute

Invoke each sub-step with context from former sections

Note: Temperatures and sections are pre-defined before generation, which may inadvertently stifle creativity or variability of output.

Stitching (On-the-Fly)

Simulate a true "one-shot" response with a mechanism for the model to halt generation when it deems a temperature change appropriate. Truly on-the-fly.

Step 1: Generate with interrupts

T=1.2Creative brainstorming...

🔧 request_sampling_update(0.3)

T=0.3Precise implementation...

🔧 request_sampling_update(1.0)

T=1.0Exploring alternatives...

Step 2: Stitch

Orchestrator seamlessly stitches segments into one continuous response

Advantage: The model can react to its own output and change direction organically as the response evolves.

•Top-down: Structure first, then fill in...think writing an outline before the essay

•Stitching: Discover structure as you go, akin to Bill O'Reilly's WE'LL DO IT LIVE

Thermostat uses the stitching approach. The top-down method has its merits and is, arguably, the true 'steering'-based approach (see my article). It's predictable, easier to debug, and guarantees coverage of planned sections, all of which are invaluable for long workflows. However, it front-loads decisions that might be better made in context and, most importantly, doesn't achieve my goal of simulating a true "one-shot" response with mid-stream temperature changes.

With the approach decided, let's look at how to actually build it.

Implementation

Under the Hood

The Stitching Trick

Public APIs don't give you true intra-call temperature changes. You'd have to tweak sampling mid-inference, and providers just don't expose that, as it's not likely something they could even do from their end. So I cheat: Thermostat stitches multiple calls (with different temperatures) into one output. Through a rigorous system prompt and by including the full response pre-temperature change, I attempt to preserve continuity between calls.

The Stitching Trick

What we want (doesn't exist):

T=1.2→Creativebrainstormideas...↓ T=0.2T=0.2→Preciseimplementation...

Single API call with mid-stream temperature change

Thermostat simulates this with

What Thermostat does:

API Call #1T=1.2

Creative brainstorm ideas...[tool_call: request_sampling_update(0.2)]

✂️ stitch

API Call #2T=0.2

[context: "Creative brainstorm ideas..."]Precise implementation...

User sees seamless output:

Creative brainstorm ideas... Precise implementation...

1. The Tool Definitions & Schema Hacking

Temperature is an API parameter. You can't tell the model to "stop and change temperature", because the knob to do so lives outside the model, at the API level. Instead, the system prompt tells the model: "Use the request_sampling_update tool when you want a different temperature."

I also need a finish_response tool, because relying on implicit stop tokens in a stitched stream is a recipe for bugs.

From src/thermostat/tools.py:

def generate_openai_tools() -> List[Dict[str, Any]]:
    """Generate OpenAI function-calling tool schemas."""

    return [
        {
            "type": "function",
            "function": {
                "name": "request_sampling_update",
                "description": (
                    "Adjust sampling temperature when your writing "
                    "phase changes. Raise for creativity/brainstorming, "
                    "lower for precision/code/facts. "
                    "When in doubt, make the change."
                ),
                "parameters": update_schema,
            },
        },
        {
            "type": "function",
            "function": {
                "name": "finish_response",
                "description": "Signal that the response is complete.",
                "parameters": finish_schema,
            },
        },
    ]

2. Why Tool Calls Are Perfect for This

In my system prompt, I had two options for telling the Thermostat loop to change temperature:

In the system prompt, instruct the model to emit some sort of special token (e.g. __CHANGE_TEMP_TO_X.X__) in its regular output. In my code, this token would be interpreted by the loop as a temperature change.
Create a synthetic tool definition, which the system prompt instructs the LLM to invoke, that changes the temperature.

Option 1

Special Token Approach

LLM output stream:

Theansweris__TEMP_0.2__42.The

Tokens generated after marker = wasted

Scan every token for pattern

Regex parse (may fail at high T)

Race to terminate stream

Discard any over-generated tokens

Option 2 ✓

Tool Call Approach

LLM output stream:

Theansweris

Tool Call

request_sampling_update(0.2)

Clean boundary, no wasted tokens

Model emits tool call naturally

API returns structured JSON

Parse validated schema

✓

Apply new temperature, continue

The token approach has some glaring issues, but as with all good design docs, it is useful to articulate them even if I knew I'd go the other route:

If the code relied on the response itself to have my control token, it would have to constantly scan the LLM response stream to catch it mid-stream and quickly terminate the LLM, potentially after further tokens were (incorrectly) generated.
There is an inherent messiness in ensuring the code properly detects and parses the token, which is increasingly likely to be malformed...especially at higher temperatures. Tool calls, on the other hand, are much less likely to be garbled or malformed, even at higher temperatures.

These issues alone made the tool call approach the natural choice, but tool calls also offer benefits that extend well beyond avoiding the token approach's flaws and align with how models actually work in the agentic age.

3. Tool Calls Are Just Tokens, But With A Learned Behavior Advantage

At the model level, tool calls are just tokens[5]. The model predicts tokens like <|tool_call|> or {"name": "function_name" the same way it predicts any other token, through the same softmax over the vocabulary. There is no special "higher-level" decision happening inside the model's forward pass that affords tool calls some 'special status'.

However, models are trained (via RLHF, instruction tuning, etc.) to recognize when tool calls are appropriate, so they've learned patterns that make tool call tokens more likely in certain contexts. This is learned behavior, not architectural privilege, but as models have become 'better' at tool calling, this behavior has become more pronounced. This is especially apparent when you compare models that are 'good' at tool calls (they are invoked at proper times, have official API support rather than being intertwined with output, are rarely malformed) with ones that are 'bad' at tool calls (e.g. Gemini 2.x series). When the context matches patterns the model saw during training—like transitioning from open-ended discussion to a concrete task—tool call tokens become more probable because those patterns frequently preceded tool calls in training data. Importantly, this learned behavior is strong enough that tool calls still reliably fire even at higher temperatures, where the flatter distribution might otherwise let alternative tokens compete.

Learned Behavior, Not Architecture

Same model, same architecture. Context changes the probabilities.

Expand

Context: Mid-brainstorm

"Here are some creative approaches: 1) We could try..."

Next token probabilities:

"using"

45%

"a"

30%

"implementing"

15%

🔧 tool call

Model "wants" to keep brainstorming

Context: Phase shift detected

"...those are the brainstormed ideas. Now let me write the technical spec:"

Next token probabilities:

🔧 tool call

52%

"##"

25%

"The"

15%

"Here"

Training taught: "precision phase → lower temp helps, and it's worth noting becuase we have a tool for that"

RLHF and instruction tuning taught the model when tool calls tend to be useful. The probability shift is just pattern matching on millions of examples where "time to implement" preceded tool use.

The "higher-level" logic is external: the inference loop and orchestrator code treats tool calls specially, intercepting them, parsing them, executing them, and modifying state before resuming generation. While this distinction doesn't exist in the model proper, tool calls occupy a privileged position in the inference ecosystem, acting as action boundaries where generation pauses and external logic takes over. This makes them a natural mechanism for signaling "I want to change my behavior" rather than "here's more content."

There's also a practical benefit baked into how these APIs work: streaming automatically stops when a tool call is emitted[3]. After a tool call, the API pauses, returns the structured tool call data, and waits instead of generating more tokens. This is exactly what I need: a clean boundary where I can inspect the request, apply the new temperature, and resume generation. No racing to catch tokens mid-stream and having to discard over-generated content. The tool call "token", even though it's just a token to the model, is afforded special treatment by the infrastructure[4] that makes it perfect for this task.

The Inference Ecosystem

Tool calls get special treatment from infrastructure, not the model itself.

Expand

ModelAll tokens are equal

softmax over vocabulary → next token

theansweris<tool_call>however...

To the model, <tool_call> is just another token with a probability

token emitted

OrchestratorTool calls get special treatment

regular token

→ Append to output buffer

→ Continue generation

<tool_call>

→ Pause generation

→ Parse structured JSON

→ Execute external logic

→ Resume with new params

The "privilege" of tool calls isn't in the model, it's in the infrastructure. By framing temperature changes as tool calls, we create natural action boundaries where the orchestrator can intercept, validate, and apply changes before resuming generation.

4. Temperature Affects Its Own Adjustment Probability

Special API treatment aside, this still means the decision to change temperature emerges from the same attention mechanism that decides everything else. When the model senses a phase shift (from creative brainstorming to technical implementation, or from factual recitation to imaginative exploration), that shift naturally surfaces as an increased probability of the tool call.

The current temperature influences the likelihood of a temperature change. There's no way to isolate "temperature for deciding to change temperature" from "temperature for everything else." It's all-or-nothing. For it to be anything else would defeat the whole point of Thermostat: letting the model decide when and how much to adjust its own sampling.

Higher temperature → more randomness → more likely to call the tool. Lower temperature → more deterministic → less likely to call the tool. A feedback loop where the current temperature setting influences the model's propensity to adjust that very setting.

The Feedback Loop

High T → more changes, Low T → fewer changes. A recursive property.

Expand

T = 1.4High

Next token probabilities:

"continue"

35%

"the"

25%

🔧 tool call

22%

"however"

18%

Flatter distribution → tool call more likely to win

T = 0.2Low

Next token probabilities:

"continue"

78%

"the"

12%

🔧 tool call

"however"

Peaked distribution → top token dominates

High TMore tool callsMore T changes

Low TFewer tool callsStays stable

The temperature affects the very decision to change itself. A recursive property with no escape hatch.

Tool call probability isn't purely random. It's contextually informed by the learned behavior we discussed above. Even at lower temperatures, if the context strongly suggests a tool call is appropriate (e.g., a clear phase shift from creative brainstorming to technical specification), that learned behavior can still push the tool call probability high enough to win the softmax. But a sufficiently low temperature can "lock in" the model to continuing generation even when a change would be appropriate. This is a limitation inherent to the way LLMs work, not a bug to be fixed.

Tangent: Why Not a Supervisor Model?

A separate model at fixed temperature could monitor and adjust, but that reintroduces the breakpoint problem.

Expand

What if I decoupled “deciding the temperature” from “living with it”? Instead of having the model decide its own temperature changes (with all the feedback loop issues that entails), I could use a separate supervisor model at a fixed temperature to watch the output stream and decide when to interrupt and adjust. The fixed temperature ensures consistency with the likelihood of a temperature change regardless of the current temperature of the output being generated.

Architecture

Main Model

T=variable

output stream

Supervisor

T=0.3 (fixed)

interrupt + set T

Trade-offs

✓Evaluation temperature is stable, no feedback loop

✓Can use a smaller/cheaper model as supervisor

✓External observer may catch shifts the main model misses

✗2x inference cost (running two models in parallel)

✗Latency: supervisor must process before deciding

✗Supervisor lacks the main model's internal state

The supervisor sees the same output but makes decisions from a consistent vantage point, not subject to the "high T makes me want to change more, low T makes me want to change less" problem. But it also lacks the main model's intent and can only react to what's already been generated without any insight into the process of selecting the next token. This trades self-awareness for stability.

Important caveat: the dual-model setup brings back the exact problems tool calls helped me avoid. Without a tool call to create a natural pause, the supervisor must interrupt the main model mid-generation. The main model may have already generated tokens past the ideal breakpoint before the supervisor can react, bringing me back to the same "generation past break point" problem, just with more infrastructure complexity.

Another approach I want to try: what if I called an external model (either the same model with fresh context, or a different model entirely) to validate temperature changes as the main model proposes them? This is tantamount to putting an LLM call in the body of the synthetic request_sampling_update tool call. I would simply give the other model the context so far, the generating model's justification for why it wants to change temperature, and have the external model either confirm, veto, or suggest a different value. There is no shortage of ways to make this more interesting (and expensive).

5. Requiring Justification for Temperature Changes

On the high temperature end of the spectrum, I can somewhat mitigate this issue by forcing the model to ground its decision to change the temperature. In order to change the temperature, the model must provide a justification. The request_sampling_update schema includes a required reasoning field. This isn't just for logging (though it is very helpful for this purpose and provides insight). To populate that field, the model must generate a coherent explanation for why the temperature should change. That generation process grounds the decision. Adding this field doesn't completely eliminate overeager temperature changes at higher temperatures, and it doesn't normalize the likelihood of changes across different settings. If the 'likelihood' of a temperature change at temperature 1.0 is considered to be the 'optimal' likelihood, this grounding requirement certainly doesn't make it more likely for a low-temperature model to increase its temperature.

The reasoning field in the schema:

update_schema = {
    "type": "object",
    "properties": {
        "reasoning": {
            "type": "string",
            "description": (
                "Brief explanation of why a temperature change "
                "is needed at this point in the response."
            ),
        },
        "new_temperature": {
            "type": "number",
            "minimum": 0.0,
            "maximum": 2.0,
        },
        # ... other params
    },
    "required": ["reasoning", "new_temperature"],
}

The reasoning requirement helps ground the LLM. With this requirement in place, it's harder for the model to decide to change temperature on a whim, as it has to justify it with a cogent reason. What's more, those justifications are genuinely useful for understanding what the model thinks it's doing. The model requesting a temperature change because it is "Transitioning to precise technical specification" tells you it understood the task structure.

A high-temperature model might want to change settings on a whim, but having to explain itself acts as a soft filter. The model has to commit to a rationale, and bad rationales tend to be harder to generate even when you're sampling creatively. It's the difference between "I feel like being more creative" (rejected by its own incoherence) and "The user asked for brainstorming alternatives, which benefits from higher variance" (coherent, actionable). I can further leverage this by adding a code-level filter to reject obviously bad reasoning that still slips through.

The Reasoning Filter

Whimsical changes are hard to justify coherently.

Expand

At high temperature, model might try:

reasoning: "I feel like being more creative now"

Vague, subjective, no grounding in task requirements

reasoning: "Temperature adjustment requested"

Circular, doesn't explain why

Grounded reasoning survives:

reasoning: "User asked for brainstorming alternatives, which benefits from higher variance"

References user request, explains mechanism

reasoning: "Transitioning to code implementation. Need deterministic output for correct syntax"

Identifies phase shift, justifies direction

The trick: Generating a coherent justification for something that has no valid justification is hard, even at high temperatures. The reasoning field acts as a soft filter, not by validating the reason, but by making whimsical changes harder to articulate.

6. Validation & Anti-Thrashing

Even with the reasoning field, I don't just blindly apply temperature changes. Models can still get twitchy, especially at high temperatures, and might try to change the temperature every 5 tokens if given the chance. And by might, I mean they absolutely will, in the case of GLM.

To prevent this inevitability, I implemented a strict AntiThrashingValidator in src/thermostat/validators.py that enforces a hard cap on thrashing and a minimum number of tokens generated between changes:

From src/thermostat/validators.py:

def validate_and_apply_tool_call(
    tool_call: RequestSamplingUpdate,
    current_params: SamplingParameters,
    buffer: OutputBuffer,
    config: Any,
) -> Tuple[SamplingParameters, Dict[str, Any]]:
    
    # Resolve relative updates (e.g., if the model 
    # requests "Creative" preset)
    resolved = resolve_sampling_params(tool_call, current_params)

    # Apply policy clamps: min_chars_between_changes, 
    # max_temp_delta, etc.
    clamped_params, warnings = AntiThrashingValidator.apply_policy_clamps(
        tool_call, 
        current_params, 
        buffer, 
        config
    )

    return clamped_params, {"warnings": warnings}

What Thrashing Looks Like

Without validation, models change temperature every few tokens.

Expand

WithoutValidator

token 1

1.6

token 5

0.3

token 8

1.8

token 12

0.2

token 15

1.4

5 API calls in 15 tokens = chaos

WithValidator

token 1

1.6

token 5

0.3→ denied

token 50

1.0(clamped)

token 55

0.2→ denied

token 120

0.4

2 changes in 120 tokens = stable

Policy enforced: min 100 chars between changes • max Δ0.7 per change • min 40 tokens between changes

What happens when a change is rejected?

When the validator denies a temperature change request, I don't just silently ignore it. The orchestrator calls the model again with the tool result explaining what happened:

// Tool result returned to model (from orchestrator.py):

{
  "applied_temperature": 0.7,  // clamped from requested 0.2
  "applied_top_p": 0.9,
  "note": "Applied requested update. Continue from OUTPUT_SO_FAR end.",
  "warnings": [
    "min_chars_between_changes: Not enough characters since last change",
    "max_temp_delta_per_change: temperature was clamped to 0.7 from 0.2"
  ]
}

The model sees these warnings in the tool result and, hopefully, can adjust its behavior. If the temperature is clamped rather than outright rejected, the model gets some of what it asked for, but within policy bounds. If it's rejected entirely, the model gets a clear message that it needs to wait longer and/or have a better reason.

If the model continues to thrash, the orchestrator takes a more aggressive step: yanking the tool from the array entirely. GLM was particularly bad about this. At higher temperatures, it really seemed to like the ability to change its temperature. It would call the tool three times in a row without generating any text. Telling it "no" wasn't enough: in these cases, I had to remove the tool so it wouldn't even know that was a possibility.

From src/thermostat/orchestrator.py:

# Disable tools if we haven't produced enough text since the last temp change
# This prevents thrashing where the model keeps changing temp without writing
chars_since_change = self.buffer.chars_since_last_change()
disable_tools = (
    self.buffer.temperature_changes > 0 
    and chars_since_change < self.config.min_chars_between_changes
)

stream = self.provider.generate_segment(
    ...
    disable_tools=disable_tools,  # Tools literally not available
)

This two-tier approach (first explaining the denial, then removing the tool) ensures the model gets feedback before I resort to enforcement. Most well-behaved models respond to the explanation at typical temperatures; the tool removal is a safety net for edge cases or at very high temperatures.

7. The Orchestrator Loop

Finally, the Orchestrator runs the show. It takes the new temperature, inserts that into the API call, feeds the model the full context plus a continuation marker, and tells it to resume exactly there.

The Stitching Loop

One response, multiple API calls, seamlessly joined

API Call #1

0.95

Streaming

Loop

Exit

↻ Loop

request_sampling_update

temp:0.1

reason:"Transitioning to precise technical specification - data model and API endpoints require exact, structured output"

Accumulated Output

Waiting for generation...

How it works: The model streams text until it calls request_sampling_update with a reason. The orchestrator applies the new temperature and continues. The segments are stitched into one seamless response.

From src/thermostat/orchestrator.py:

async def generate(self) -> AsyncGenerator[StreamChunk, None]:
    while not self.state.is_finished:
        # Call provider with CURRENT sampling params.
        # Note: We pass sampling_params dynamically on every iteration.
        stream = self.provider.generate_segment(
            user_prompt=self.user_prompt,
            output_buffer=self.buffer,
            sampling_params=self.state.sampling_params, 
            system_prompt=self.system_prompt,
        )

        accumulated_text, tool_calls = await self._accumulate_segment_from_stream(stream)

        # Handle tool calls
        if tool_call := (tool_calls[0][0] if tool_calls else None):
            
            if isinstance(tool_call, RequestSamplingUpdate):
                self._handle_sampling_update(tool_call, self.state)
                # We update state, but do NOT break the loop. 
                # The next iteration will pick up the new params.

            elif isinstance(tool_call, FinishResponse):
                self._handle_finish_response(tool_call, self.state)
                # This sets self.state.is_finished = True, 
                # breaking the while loop.

The orchestrator also handles content validation to ensure the append-only paradigm holds. Since I'm stitching multiple API calls together, the model might accidentally repeat earlier output, attempt to rewrite what it said before, or leak chain-of-thought reasoning (e.g., "the user asked for..." instead of just continuing). The AppendSafetyValidator catches these:

The real nightmare was high temperature. At low temperature, the model follows the "continue exactly from here" instruction perfectly. At high temperature, it was as if I didn't give any instructions whatsoever.

It would add "Continuing from where I left off..." even though I explicitly told it not to. It would restart sentences. It would summarize what it just said instead of continuing. I had to build overlap detection, repeat detection, and a repair mode that calls the model again at low temperature with an explanation of what it did wrong.

•Repeat detection: Catches when the model copies large chunks of its prior output verbatim
•Overlap trimming: Removes redundant text when the model restarts from slightly before where it left off
•Rewrite phrase detection: Catches phrases like "let me restate" or "to clarify what I said"
•CoT leakage detection: Catches when the model outputs reasoning ("the user asked for...") instead of content

When validation fails, the orchestrator goes into repair mode: it calls the model again at low temperature (T=0.2) with an explicit instruction explaining the violation and asking it to continue correctly. If repair attempts are exhausted, the response terminates with an error.

The "Call Boundaries" Logic

I've noticed a lot of software developers who are very experienced but treat LLMs as a black box often say, "When I'm doing ideation I like talking in the ChatGPT or Claude web interface, and then implementing in Cursor or Copilot."

What they're actually benefiting from is the higher temperature in those chat interfaces versus the precision that comes with an agentic coding tool, which probably sets its model to a much lower temperature in its API calls. They've accidentally discovered dynamic temperature by switching apps. Thermostat tries to make that explicit within a single response.

The Workflow Developers Use Today

ChatGPT for ideation, Cursor for implementation—Thermostat unifies this.

Expand

Typical developer workflow:

💬

ChatGPT / Claude

"Brainstorm approaches for..."

T ≈ 0.7-1.0Creative, exploratory

⌨️

Cursor / Copilot

"Implement this function..."

T ≈ 0.0-0.3Precise, deterministic

Context lost between tools • Manual copy-paste • Two separate sessions

With Thermostat:

🌡️

Single Thermostat Session

T=0.9 brainstormT=0.5 outlineT=0.1 implementT=0.8 alternatives

✓ Context preserved • ✓ Model decides boundaries • ✓ Seamless transitions

A Literary Example:

Imagine you want the craziest possible story: the craziest person alive runs for president, wins, and gets sworn in while the country panics. Maybe it's an alien, maybe it's Vermin Supreme, maybe it's a baby. In this case, you would likely want the temperature extremely high to get zany ideas about who is being sworn in.

However, when this zany character is actually raising their right hand and taking the oath of office though, you want the actual oath of office ("I do solemnly swear to execute the duties faithfully...") verbatim to provide contrast between the seriousness of the role and the insanity of who is being sworn into that role. You would want to lower the temperature for that so it is precise and accurate to the actual text of the oath, which the model has seen many times during training.

It's a silly example, but it illustrates when you might want a lower temperature versus a higher one within the same generated response.

TL;DR

Thermostat uses tool calls as “action boundaries” where the model can ask to change temperature. The orchestrator intercepts these calls, validates them (anti-thrashing policies prevent abuse), and stitches together multiple API calls to simulate dynamic temperature. The model must provide reasoning for each change, which acts as a soft filter against whimsical requests. The cost is no KV cache reuse (each temperature change = new API call with full context). The benefit is model-driven creativity/precision boundaries within a single logical response.

Known Limitations

•No KV cache reuse: Each temperature change triggers a new API call with full context. This is expensive for responses with many changes.
•Feedback loop: High temperatures increase tool call likelihood (more changes), while low temperatures decrease it (fewer changes). In a cruel twist of fate, a model's current temperature affects the probability that this temperature will be modified in the future.
•Reasoning field is a soft filter: While it helps ground decisions, it doesn't fully normalize change frequency across temperatures or prevent all whimsical requests.
•Simulation, not true intra-call: This is stitching multiple calls together, not modifying the inference loop midstream. True intra-call support would require provider-level changes.

Deep Dive

The KV Catch

Now that you've seen how the sausage is made within the orchestration loop, let's talk about the bill: KV cache.

One reason inference got cheaper: KV cache cuts repeat compute[6]. You can see this in action in the way many LLM providers price their models, where you get a much cheaper price per token when there is a cache hit.

This stitching approach means you don't reuse KV cache between segments. Every temperature change is a new call, and each call reprocesses the full accumulated context, which gets bigger at each intra-response temperature change. What's more, in my experiments, asking for a different temperature precludes any sort of provider-side caching whatsoever.

The Hard Truth

In this stitching approach, each segment is a new API request. In current implementations, providers generally can't reuse KV cache across separate calls with different sampling parameters. You repay the prefill cost every time.

The Cost Breakdown

With 3 temperature changes and 2000 tokens of context, you're paying for:

Call 1:System prompt + initial generation
Call 2:System prompt + ~500 tokens context + generation
Call 3:System prompt + ~1200 tokens context + generation
Call 4:System prompt + ~1800 tokens context + final generation

That's a lot of redundant input token processing. By definition, if you're calling with a new temperature, you're not going to get a cache hit. The provider treats it as an entirely new call with context you pre-filled in.

If this were something people actually wanted to use at scale, things would have to change on the provider end. But for experimenting, the cheap inference providers make it a non-factor. I used GLM and Chutes AI, which is $10/month for 1000 calls per day (frankly, this is a steal, even if some of the models are quantized). You can spam it without financial penalty while you figure out if this even works.

The Future: True Intra-Call Support

What would be actually interesting is if inference providers supported temperature changes mid-stream. Imagine an API that looked like:

response = llm.generate_with_dynamic_temp(
    messages=[...],
    initial_temperature=0.7,
    temp_change_token="[TEMP:",
    temp_end_token="]"
)

The provider would handle the temp changes internally without breaking the KV cache. The model generates[TEMP:0.3], the sampling parameters update, and generation continues seamlessly from the cached attention states.

Inference providers are optimizing for throughput, not flexibility. Adding mid-stream parameter changes would complicate batching strategies and require careful handling of speculative decoding. But hey, a guy can dream.

Enough theory...let's see what this actually looks like in practice.

In Practice

Here's what Thermostat actually produces. I generated a bunch of situations to see if the model (GLM-4.7) would, number one, take the hint, and number two, actually produce better output.

Here are the actual raw results from the experiments. Click through the phases to see how temperature shifts change the output character.

Prompt: "Brainstorm 10 unconventional ideas for a mobile app that helps people sleep better. Then pick your favorite idea and write a precise technical specification with: Core features (bullet list), Data model (show exact field names and types), API endpoints (method, path, request/response)."

0.70 → 0.95

"Starting creative brainstorming phase - need high creativity for unconventional app ideas"

Temp:

0.95

# 10 Unconventional Sleep Aid Concepts

1. Murder, She Slept
An interactive audio drama where the listener plays the victim
of a crime. To survive the "attack," they must keep their
breathing and heart rate extremely low. If biometric data
spikes, the "killer" gets closer.

2. Lucid Loot Box
An app that trains you for lucid dreaming by playing a specific
sound cue only during REM sleep. If you perform a "reality check"
upon hearing the cue (tracked via smart ring movement), you
unlock digital currency. If you fail, you lose in-game items.

3. The Bedbug
A virtual pet that lives on your mattress. It thrives on "sleep
crumbs"—the minutes you spend tossing and turning. The worse
you sleep, the fatter and happier it gets. To starve it, you
must achieve perfect sleep inertia.

4. Somnambulist's Risk
Uses a phone camera pointed at the bedroom floor. If it detects
the user sleepwalking, it triggers an immediate, jarring "Game
Over" screen with a loud noise to wake them up, forcing them
to "retry" the level (going back to bed).

5. Social Stock Market
A trading platform where the commodity is your friends' sleep
quality. Users can "Short" a friend who just had a baby or
"Buy" a friend who is on vacation. Real money is at stake,
verified by wearable data.

6. Nightmare Difficulty Selector
Allows you to choose the "theme" of your dream via subliminal
audio suggestions while you drift off (e.g., "Flying,"
"Underwater," "Anxiety"). In the morning, you rate the
accuracy to train the AI.

7. The Insomniac's Exorcism
Treats insomnia as a possession. The app features a gruff,
tired demonologist who guides you through banishment rituals
(progressive muscle relaxation) while arguing with the
"spirit" keeping you awake.

8. Sleep Paralysis Poker
You play poker against AI opponents. The "currency" is time.
You can only bet time you spent in deep sleep the previous
night. If you lose, your alarm goes off 30 minutes earlier.

9. Mystery Science Theater: REM
The app listens to you sleep talk (if you do). It records
snippets and uses AI to insert them into a pre-generated
comedic podcast that you can listen to over breakfast.

10. Graveyard Shift Gardener
An AR game where your bedroom becomes a garden. You can only
water plants when your phone detects total darkness and
silence. Weeds grow if you check your phone in the middle
of the night.

─────────────────────────────────────────────────────────────────

Selected Specification: Concept #1 "Murder, She Slept"

This app gamifies biofeedback by turning relaxation into a
survival mechanism. It utilizes high-fidelity audio and wearable
integration to create a sense of dread that can only be
alleviated by sleep.

Core Features:
• Biometric Survival Engine
Real-time monitoring of Heart Rate (HR) and Heart Rate
Variability (HRV) via Apple Health / Google Fit Connect.
The "threat level" in the story is inversely proportional
to the user's calmness.

• Adaptive Binaural Audio
A procedural audio engine that generates 3D spatial sound
(footsteps, breathing, floorboard creaks) based on the
user's movement and heart rate. Sounds move closer as
agitation increases.

• The "Stalker" AI
A logic system that tracks sleep cycles. It waits for the
user to enter Light Sleep before initiating "scare events"
designed to wake the user slightly if they are sleeping
too restlessly, or guide them into deep sleep if calm.

• Morning Debrief
Upon waking, the user receives a "Survival Report" showing
a graph of their heart rate overlaid with the narrative
events of the night (e.g., "At 3:00 AM, the killer entered
the room while your HR was 58bpm. Good job hiding.").

• Difficulty Tiers
Ranging from "Stalker" (Standard) to "Nemesis" (Requires
HR < 50 BPM to survive the night).

Prompt: "Design a simple REST API for a todo list app. Include: 5 endpoints (method + path + brief description), Example request/response for each. Then perform a security audit of your own API: Identify 3+ potential vulnerabilities, Classify each by OWASP category, Propose specific mitigations."

Starting at 0.70

"Balanced temperature for API schema design"

Temp:

0.70

REST API DESIGN: Todo List Application
══════════════════════════════════════════════════════════════

Base URL: https://api.example.com/v1

────────────────────────────────────────────────────────────────
ENDPOINT 1: Create Todo Item
────────────────────────────────────────────────────────────────
POST /todos

Creates a new todo item for the authenticated user.

Request:
  POST /v1/todos HTTP/1.1
  Host: api.example.com
  Authorization: Bearer eyJhbGciOiJIUzI1NiIs...
  Content-Type: application/json

  {
    "title": "Complete project documentation",
    "description": "Write API docs and user guide",
    "dueDate": "2024-12-31T23:59:59Z",
    "priority": "high"
  }

Response: 201 Created
  {
    "id": "12345",
    "title": "Complete project documentation",
    "description": "Write API docs and user guide",
    "dueDate": "2024-12-31T23:59:59Z",
    "priority": "high",
    "completed": false,
    "createdAt": "2024-01-15T10:30:00Z",
    "updatedAt": "2024-01-15T10:30:00Z"
  }

────────────────────────────────────────────────────────────────
ENDPOINT 2: List All Todos
────────────────────────────────────────────────────────────────
GET /todos

Retrieves all todo items for the authenticated user with 
optional filtering.

Request:
  GET /v1/todos?status=pending&priority=high&limit=10 HTTP/1.1
  Host: api.example.com
  Authorization: Bearer eyJhbGciOiJIUzI1NiIs...

Response: 200 OK
  {
    "todos": [
      {
        "id": "12345",
        "title": "Complete project documentation",
        "priority": "high",
        "completed": false,
        "createdAt": "2024-01-15T10:30:00Z"
      }
    ],
    "pagination": {
      "total": 2,
      "limit": 10,
      "offset": 0
    }
  }

────────────────────────────────────────────────────────────────
ENDPOINT 3: Get Single Todo
────────────────────────────────────────────────────────────────
GET /todos/{id}

Retrieves a specific todo item by ID.

Request:
  GET /v1/todos/12345 HTTP/1.1
  Authorization: Bearer eyJhbGciOiJIUzI1NiIs...

Response: 200 OK
  {
    "id": "12345",
    "title": "Complete project documentation",
    "description": "Write API docs and user guide",
    "dueDate": "2024-12-31T23:59:59Z",
    "priority": "high",
    "completed": false
  }

────────────────────────────────────────────────────────────────
ENDPOINT 4: Update Todo
────────────────────────────────────────────────────────────────
PATCH /todos/{id}

Partially updates a todo item (only provided fields updated).

Request:
  PATCH /v1/todos/12345 HTTP/1.1
  Authorization: Bearer eyJhbGciOiJIUzI1NiIs...
  Content-Type: application/json

  {
    "completed": true,
    "priority": "medium"
  }

Response: 200 OK
  {
    "id": "12345",
    "completed": true,
    "completedAt": "2024-01-16T14:22:00Z",
    "updatedAt": "2024-01-16T14:22:00Z"
  }

────────────────────────────────────────────────────────────────
ENDPOINT 5: Delete Todo
────────────────────────────────────────────────────────────────
DELETE /todos/{id}

Permanently deletes a todo item.

Request:
  DELETE /v1/todos/12345 HTTP/1.1
  Authorization: Bearer eyJhbGciOiJIUzI1NiIs...

Response: 204 No Content

Prompt: "Write a design document for a URL shortener service with these sections: 1. Problem Statement, 2. Brainstorm - List 5 different high-level approaches (be creative), 3. Proposed Solution - Pick one and describe the architecture precisely, 4. Data Model - Exact schema with field names and types, 5. Edge Cases - List 5 edge cases and how you'd handle each, 6. Open Questions"

0.70 → 0.90

"Brainstorming multiple creative approaches for URL shortener architectures"

Temp:

0.90

URL SHORTENER SERVICE DESIGN DOCUMENT
══════════════════════════════════════════════════════════════

1. PROBLEM STATEMENT
────────────────────────────────────────────────────────────────
We need to build a scalable URL shortener service
(e.g., https://shr.lt/abc12) that allows users to convert
long, unwieldy URLs into short, shareable links.

Primary users: social media managers, marketing teams, and
general users who need to share links in character-constrained
environments or simply want cleaner, trackable URLs.

The service must generate unique short codes, handle high
read traffic with low latency, and potentially offer analytics
on click-through rates.

2. BRAINSTORMING HIGH-LEVEL APPROACHES
────────────────────────────────────────────────────────────────

Approach 1: Pure Database Counter
Use a centralized SQL database with an auto-incrementing
integer ID. Convert this ID (Base62) to generate the short
code. Simple and guarantees uniqueness, but the database
is a hard write bottleneck.

Approach 2: Pre-generated Token Pool
A background worker continuously generates random short
strings and inserts them into an "Available Tokens" table.
When a user requests a short link, we atomically pop a
token. This moves generation time out of the critical path.

Approach 3: Snowflake ID per Shard
Distribute writes across N database shards. Each shard
generates unique IDs using a modified Snowflake algorithm
(timestamp + shard ID + sequence). The short code includes
the shard ID to enable routing.

Approach 4: Zstd Compressed Path
Attempt to compress the entire target URL using a
compression algorithm like Zstd and encode the binary
result in Base64. If the result is short enough, use it;
otherwise, fall back to hashing. The URL itself determines
the code.

Approach 5: Deterministic Hashing with Keyless Resolution
Hash the long URL (e.g., SHA-256) and take the first 7
characters. The "database" is a distributed key-value
store (like Cassandra) where the Key is the hash. To
resolve, the client looks up the hash directly.

Prompt: "Consider this Python code: def mystery(n): if n <= 1: return n return mystery(n-1) + mystery(n-2) First, make 3 quick guesses about what this function does and its behavior. Don't analyze deeply yet - go with your gut. Then systematically verify each guess by: Tracing through specific inputs, Identifying the exact recurrence relation, Stating the time complexity with justification."

0.70 → 0.90

"Starting with gut-level brainstorming for 3 quick guesses - need creative, intuitive thinking without deep analysis"

Temp:

0.90

3 INITIAL GUESSES
══════════════════════════════════════════════════════════════
(Based on a quick scan of the code structure)


Guess 1: It calculates the Fibonacci sequence.
────────────────────────────────────────────────────────────────
The recursive definition (n-1 and n-2) combined with the 
base cases (0 and 1) strongly suggests this is the naive, 
recursive implementation of finding the n-th Fibonacci number.


Guess 2: It has exponential time complexity.
────────────────────────────────────────────────────────────────
Because each call branches into two more calls without any 
memoization or optimization, I expect it to perform redundant 
calculations, making it extremely slow for even moderately 
large inputs (like n > 30).


Guess 3: The base case handles negative inputs poorly.
────────────────────────────────────────────────────────────────
The condition n <= 1 treats negative numbers the same as 0 
or 1. It likely just returns n for negatives, which might be 
mathematically undefined for Fibonacci or simply unintended 
behavior rather than an error.

Prompt: "Implement a Python function retry(fn, max_attempts=3, backoff=1.0) that: Calls fn() and returns its result, On exception waits backoff seconds and retries, Doubles backoff each retry, After max_attempts re-raises the last exception. Then write a code review as if you're a senior engineer reviewing a junior's PR: List specific issues (bugs, edge cases, style), Rate severity of each (critical/major/minor/nit), Suggest concrete fixes."

0.70 → 0.10

"Starting with precise code implementation - need deterministic, correct output"

Temp:

0.10

IMPLEMENTATION
══════════════════════════════════════════════════════════════

import time
import functools

def retry(fn, max_attempts=3, backoff=1.0):
    """
    Retries a function with exponential backoff.
    
    Args:
        fn: The callable to execute.
        max_attempts: Maximum number of execution attempts.
        backoff: Initial backoff delay in seconds.
        
    Returns:
        The result of the function call.
        
    Raises:
        Exception: The last exception encountered if all 
                   attempts fail.
    """
    last_exception = None
    current_backoff = backoff
    
    for attempt in range(max_attempts):
        try:
            return fn()
        except Exception as e:
            last_exception = e
            # If this was the last attempt, don't wait/loop
            if attempt == max_attempts - 1:
                break
            time.sleep(current_backoff)
            current_backoff *= 2
            
    raise last_exception

Prompt: "Explain why the sum 1 + 2 + 3 + ... + n equals n(n+1)/2. First, give 2-3 intuitive explanations that a curious high schooler would find satisfying. Use analogies, visual reasoning, or concrete examples. Then provide a rigorous proof by mathematical induction with explicit base case, inductive hypothesis, and inductive step."

Starting at 0.70

"Balanced temperature for creative explanations with analogies"

Temp:

0.70

INTUITIVE EXPLANATIONS
══════════════════════════════════════════════════════════════


EXPLANATION 1: The Handshake Problem (Gauss's Story)
────────────────────────────────────────────────────────────────
Imagine you're at a party with n people (including yourself). 
Everyone shakes hands with everyone else exactly once. How 
many handshakes occur?

  • You shake hands with (n - 1) other people
  • The next person shakes hands with (n - 2) remaining 
    people (they already shook with you)
  • This continues until the second-to-last person shakes 
    hands with 1 person

The total is: 1 + 2 + 3 + ... + (n - 1)

But there's a clever way to count the same handshakes: since 
each handshake involves 2 people, the total number of 
handshakes equals the number of ways to choose 2 people 
from n, which is:

        n(n-1)
        ──────
          2

This shows that 1 + 2 + 3 + ... + (n - 1) = n(n-1)/2, and 
adding n to both sides gives us the formula for 1 to n.


EXPLANATION 2: Visual "Staircase" Proof
────────────────────────────────────────────────────────────────
Picture a staircase made of blocks:

    █
    █ █
    █ █ █
    █ █ █ █

This represents 1 + 2 + 3 + 4 = 10 blocks.

Now, imagine making a COPY of this staircase and flipping 
it upside down:

    █ █ █ █        █ █ █ █
    █ █ █          █ █ █
    █ █            █ █
    █              █

When you push the two staircases together, they form a 
RECTANGLE:

    █ █ █ █ █
    █ █ █ █ █
    █ █ █ █ █
    █ █ █ █ █

The rectangle has 4 rows and 5 columns (one more than 
the number of rows), giving 4 × 5 = 20 blocks.

Since this is TWO copies of our original sum, we divide by 2:

        4 × 5
        ───── = 10
          2

Generalizing: for n rows, we get a rectangle of size 
n × (n + 1), and dividing by 2 gives the formula.


EXPLANATION 3: The "Bookend" Pairing Method
────────────────────────────────────────────────────────────────
Here's a beautifully simple observation. Write the sum twice:

    S = 1   + 2     + 3     + ··· + (n-1) + n
    S = n   + (n-1) + (n-2) + ··· + 2     + 1

Now add the two equations vertically (column by column):

    2S = (1+n) + (2+(n-1)) + (3+(n-2)) + ··· + (n+1)

Each pair adds to (n + 1), and there are n such pairs:

    2S = n(n + 1)

Therefore:

            n(n+1)
    S   =   ──────
              2

This is the method Gauss allegedly used as a child to 
instantly sum 1 to 100!

Prompt: "Write an original 12-line poem about the feeling of debugging code at 3am. Use vivid imagery and metaphor. Then write a literary analysis of your own poem: Identify the meter and rhyme scheme, Explain 2 specific metaphors and what they represent, Discuss the emotional arc."

0.70 → 0.90

"Starting creative poetry writing - need high temperature for vivid imagery and metaphorical language"

Temp:

0.90

THE GHOST IN THE MACHINE
══════════════════════════════════════════════════════════════

The monitor burns a crater in the dark,
A pixel sun that scorns the hour of sleep.
The logic fails to hit its target mark,
And buried secrets promise they will keep.

I hunt the syntax with a weary eye,
A shepherd counting sheep that will not run.
The cursor blinks its rhythmic, lullaby,
Refusing to admit the hunt is done.

But dawn approaches with its cruel decree,
The ghost remains unseen within the code.
A phantom logic—wild and error-free—
That travels on this silent, binary road.

The coffee's cold, the compiler's slow,
The only witness to this logic's woe.

Prompt: "Write a short story (3-4 paragraphs) about a futuristic marketplace on a space station. Include details about 3 specific vendors, their names, what they sell, and their prices in "Galactic Credits". Then extract the vendor information into a valid JSON object with a specific schema."

0.70 → 0.80

"Starting creative story writing - need higher temperature for descriptive world-building and character names"

Temp:

0.80

THE NEON BAZAAR
══════════════════════════════════════════════════════════════

The Neon Bazaar on Station 7 hummed with the sound of a
thousand languages and the rhythmic clinking of mag-lev carts.
High above the artificial clouds, holographic advertisements
flickered, casting long shadows over the narrow walkways. Here,
between the oxygen scrubbers and the gravity generators, one
could find anything from ancient Earth relics to the latest in
bio-mechanical enhancements.

At the far end of the promenade sat Ollivander's Orbitals, a
cramped stall smelling of ozone and old copper. The proprietor,
a three-armed Rigellian named Jax, specialized in salvaged
navigation chips. "Guaranteed to bypass any Imperial blockade,"
he'd whisper to anyone who lingered too long. His prized
possession, a refurbished Core-Link Processor, sat in a
velvet-lined case, priced at a steep 4,500 Galactic Credits.
For those on a tighter budget, Jax offered basic star-charts
for 150 Credits each.

Further down the lane, the air grew thick with the scent of
synthetic jasmine. This was the domain of The Alchemist's
Atrium, where a woman named Elara sold bottled atmospheres
from dead worlds. Her stall was a forest of glass vials, each
containing a swirling vortex of colored gas. A single breath
of "Vintage Mars" would set you back 800 Galactic Credits,
while the more common "Luna-Mist" was a bargain at 250 Credits.
Elara claimed her scents were the only way to remember a home
that no longer existed.

Near the central hub, a crowd always gathered around Glint &
Gear, a high-end cybernetics boutique run by a silent android
known only as Unit 42. The shop was a sterile white cube,
filled with sleek, chrome-plated limbs and ocular implants. A
standard-issue Mark IV Bionic Arm was the most popular item,
retailing for 12,000 Galactic Credits. For those seeking a
minor upgrade, Unit 42 offered reinforced carbon-fiber joints
for 1,200 Credits per set.

Wrapping Up

The Verdict

I'll be frank: this almost certainly doesn't matter for simple queries. Even for complex tasks, it's not likely something that will provide significant benefits that are worth the cost and UX tradeoffs, especially for workflows that benefit from steering (shameless plug for my first article, BTW). It's more likely that you'd want to break that complex one-shot prompt into multiple [prompts | messages | threads | workflows] anyways, at which point you would just set different temperatures for each, if you were so inclined.

What's more, Z.ai themselves [7], the creators of GLM, now recommend keeping temperature at 1.0 for most use cases. If official guidance from AI labs and the fact that the GPT-5 series doesn't even have a temperature parameter available in its API[8] are any indication, temperature might be becoming less relevant as models get smarter about self-regulating. I remember when the temperature made a huge difference when playing around in tools like OpenRouter's AI Chat Playground, but maybe temperature matters less than it used to.

In fact, while writing this, it is possible the most important thing I remembered was that thermostats don't actually typically change the temperature of the heater/AC itself; it blows air at a fixed hot/cold temperature and they simply turn it on or off. I am the king of reminding people about this fact when they think setting the thermostat to 80 will warm the room from 65 -> 70 more quickly. So, while my experiment may have worked somewhat, my pun does not. Oh well.

So, going back to my wife's initial question about whether you can change temperature on-the-fly: the answer is yes, you can. As with many things in the age of vibe coding and AI slop, whether you should is a different question, and the answer is probably not.

However, the best experiments are often the ones that answer "can we?" even when the answer to "should we?" is "probably not". Behind every "probably not" is a "why not?" worth asking. Not "why not?" in the literal "give me a reason" sense, as you'll surely get plenty of reasons from pedants and misanthropes alike, but "why not?" in the whimsical "why the hell not?" sense. Remember when they did surgery on a grape (the meme from ~2016)? It turns out that this is actually a useful technique for testing cutting edge micro-surgery robotics, owing to the precision required to operate on such a delicate but delicious fruit. However, a cursory Google search for "grape surgeons near me" reveals that this is not a common medical specialty, even a decade later. Sometimes, the gap between "why we could" and "why we don't" shines light on what is possible or valuable theoretically versus what is actually useful in the real world. In both cases, the curious question is still worth asking. Thanks for coming along for the ride.

— Jeff

References

[1]Hinton, G., et al. (2015). Distilling the Knowledge in a Neural Network. NIPS Deep Learning Workshop.
[2]von Platen, P. (2020). How to generate text: using different decoding methods for language generation with Transformers. Hugging Face Blog.
[3]OpenAI. Streaming API Reference.
[4]OpenAI. Function calling guide.
[5]Schick, T., et al. (2023). Toolformer: Language Models Can Teach Themselves to Use Tools. NeurIPS 2023.
[6]Kwon, W., et al. (2023). Efficient Memory Management for Large Language Model Serving with PagedAttention. SOSP '23.
[7]Zhipu AI. Migrate to GLM-4.7.
[8]OpenAI. Using GPT-5.2.

Continue Reading

The Wheel Spinner vs. The Orchestrator

Stop Pedaling. Start Steering.

Thermostat

The "Thermostat" Experiment

The Theory: It's Just Math

The Autocomplete Metaphor (Beaten to Death Edition)

Do You Believe in Magic, in an LLM's Heart?

The Temperature Spectrum

Technical Aside

Softmax & Sampling

Why Stitching?

Under the Hood

The Stitching Trick

1. The Tool Definitions & Schema Hacking

2. Why Tool Calls Are Perfect for This

Special Token Approach

Tool Call Approach

3. Tool Calls Are Just Tokens, But With A Learned Behavior Advantage

4. Temperature Affects Its Own Adjustment Probability

5. Requiring Justification for Temperature Changes

6. Validation & Anti-Thrashing

What happens when a change is rejected?

7. The Orchestrator Loop

The Stitching Loop

The "Call Boundaries" Logic

The KV Catch

The Hard Truth

The Cost Breakdown

The Future: True Intra-Call Support

In Practice

1. Brainstorm → Spec

2. API Draft → Security Review

3. Multi-Phase Design Document

4. Guess → Verify

5. Implement → Code Review

6. Intuition → Proof

7. Poetry → Literary Analysis

8. Story → JSON Extraction

The Verdict

References

Continue Reading

The Wheel Spinner vs. The Orchestrator