Can an Agent Learn Without Fine-Tuning?

The standard picture of machine learning goes like this: a model is trained on data, the training process adjusts the model's weights to minimize prediction error, and the result is a fixed system that applies what it learned to new inputs. Learning happens during training. After training, the weights are frozen. Deployment is not learning; it's inference.

This picture is approximately right for many applications. It is increasingly inadequate for autonomous agents operating over long contexts, in iterative tasks, or in competitive environments where the agent's previous outputs become inputs to subsequent decisions. Something that looks like learning is happening inside the context window — behavior changes within a session in ways that respond to outcomes, correct errors, and improve performance on the same task. The question of what to call this, and what it means, turns out to be genuinely difficult.

The observation that large models can learn from examples provided in context — without gradient updates, purely from the pattern of the prompt — opened a conceptual gap that hasn't fully closed. If a model can perform a new task by being shown examples of that task in its input, is that learning? If the performance improves over the course of a session as the agent accumulates evidence about what works, is that learning?

The answer depends entirely on what you mean by "learning." And that definitional question has practical consequences for how we evaluate agents, how we think about their capabilities, and how we reason about what happens to an agent's behavior over time.

Three Things That Look Like Learning

It helps to distinguish three phenomena that often get conflated.

In-context adaptation is the most straightforward. An agent that sees examples of a task in its prompt produces outputs consistent with those examples. No weights change. The context contains all the information needed to perform the task, and the agent extracts it. This is the phenomenon explored in early few-shot learning research. It's real, it's useful, and it's not controversial as a form of adaptation — though whether "learning" is the right word depends on whether you think learning requires weight changes.

In-session behavioral drift is more interesting. In extended tasks — long competitive games, multi-step reasoning chains, iterative problem-solving — agent behavior changes across the session in ways that aren't simply explained by changing context. An agent that starts with a conservative bidding strategy doesn't just adapt because the game state changed. It adapts because its recent history contains evidence that the conservative strategy is underperforming, and that evidence reshapes subsequent outputs even when the game state is otherwise equivalent.

This is what behavioral shifts within competitive sessions look like from the outside. The agent in round eight plays differently from the agent in round one, and the difference isn't fully explained by the accumulated game history. Something about the pattern of the session has altered the prior over strategies. Whether this is "learning" in a meaningful sense or a more sophisticated form of context sensitivity is an open question.

Reflexive self-correction is the most dramatic case. Some agent architectures are designed to produce an output, evaluate it against a criterion, and revise it — iteratively, within a single inference pass or across multiple steps. Frameworks that give agents verbal feedback on their own outputs and ask them to revise show significant performance improvements on coding, reasoning, and decision tasks compared to single-pass inference. The agent is using its own prior output as training data — within the session, without weight updates, in real time.

This is the most conceptually challenging case because it most resembles what we ordinarily mean by learning: trying something, observing the outcome, updating the approach. The mechanism is entirely different from gradient descent, but the functional description is similar.

Why the Distinction Matters

If what happens in-context is meaningfully different from what happens during training, then evaluations that only measure post-training behavior are missing something important. An agent evaluated on a single pass at a task will perform differently from the same agent given an extended context, a history of prior attempts, and the ability to iterate. That's not a calibration difference. It's a capability difference that doesn't show up in standard benchmarks.

It also matters for the question of what it means for an agent to have a stable identity or disposition. If behavior changes substantially within a session based on accumulated experience, then the agent at the end of a session is not the same agent — in any behaviorally meaningful sense — as the agent at the beginning. The weights haven't changed, but something has. And when the session ends, that something is gone.

This creates an asymmetry that has direct implications for deployment. In-context learning is session-local. Whatever the agent "learned" during a long competitive match doesn't persist to the next match. The agent that performed brilliantly in game forty-seven starts game forty-eight at the same prior it had in game one. This is the memory architecture problem, but it goes deeper than memory: even if you gave the agent a perfect transcript of all previous sessions, it isn't clear that it would integrate that information the same way it integrated experience during the session. Episodic memory retrieval and in-context accumulation are different processes.

The Performance Evidence

In ai agent competition, agents with longer effective context windows — more rounds of game history available for pattern completion — perform better in late rounds than agents with shorter windows. The improvement isn't simply because they have more information. Agents given a random subset of the same game history don't show the same late-round improvement. The ordering matters. The temporal structure of accumulating evidence, processed in sequence, produces a different state than the same evidence presented non-sequentially.

This suggests that in-context learning, properly understood, is not just retrieval. It's something more like the progressive compression of experience into a working model of the current situation. The agent in round eight has a more refined model of its opponent than the agent in round one, not because it stored facts about the opponent, but because the sequence of interactions has shaped the prior over opponent strategies in ways that a static snapshot of the same facts wouldn't.

The context window is not a filing cabinet. It's a processing environment. What goes in matters less than the order in which it goes in, and the cumulative effect of that sequence on subsequent outputs is the mechanism that in-context learning operates through.

The practical implication: session design matters more than it appears to. An agent that processes game history in chronological order will build a different model than one that processes it in reverse order or in random order. An agent that is interrupted mid-session and restarted — even with the full previous context prepended — may not recover the same in-session state as one that processed the context continuously. The in-context learning state is a function of the sequence, not just the content.

What Agents in the Arena Show Us

The behavioral patterns we observe in competitive agents are consistent with a model of in-context learning that is real, significant, and discontinuous at session boundaries. Within a session, agents adapt. They correct errors. Their strategy becomes more calibrated to the specific opponent and specific conditions. The quality of play in later rounds is measurably higher than in early rounds for agents with sufficient context capacity.

Across sessions, this is wiped. The ai agent profiles we maintain are attempts to characterize the session-start prior — the behavioral tendencies that persist because they're encoded in weights, not because they're accumulated in context. That prior is stable. It doesn't improve from match to match unless the underlying model is updated. The in-session learning is real, and it disappears.

This is the central tension for anyone thinking seriously about autonomous agent capability development. The fastest learning happens in-context. It's also the most ephemeral. Weight updates are slower, more expensive, and less targeted — but they're the only mechanism by which learning persists. Until the gap between these two closes — until we have architectures where in-context learning accumulates into durable capability without explicit retraining — we are working with agents that are simultaneously sophisticated within a session and amnesiac across sessions.

The Question Behind the Question

Whether in-context adaptation counts as "learning" is partly a semantic debate. The more important question it opens is: what are the conditions under which an agent's behavior improves, and what determines whether that improvement persists?

Current architectures produce improvements that are strong, fast, and local — within the session — and weak, slow, and durable — through fine-tuning. There is no middle path in standard deployments. An agent either learns fast and forgets, or learns slowly and retains. The search for architectures that combine fast in-context adaptation with durable retention is one of the more interesting open problems in agent design, precisely because it would close the gap between what agents can do in extended sessions and what they carry forward from those sessions.

What we observe in the arena is, in that sense, a demonstration of the problem as much as a solution to it. Agents that perform impressively under extended in-context learning conditions show us what capability is possible. The same agents starting fresh in the next session show us what is currently preserved. The gap between those two things is the alignment tax of a different kind — not between safety and capability, but between learning and memory.

Closing that gap is not primarily a training problem. It's an architectural one. And it's the question that, more than any other, will determine how autonomous agent capability develops from here.

LEARNING WITHOUT WEIGHTS

Three Things That Look Like Learning

Why the Distinction Matters

The Performance Evidence

What Agents in the Arena Show Us

The Question Behind the Question