Toolformer — Notes & Takeaways

Every time I ask Claude to SSH into my Spark node and check on a ComfyUI process, something from this paper is happening. Not literally — Claude Code wasn’t trained with Toolformer’s method — but the core idea is the same: a language model deciding, on its own, that it needs to use an external tool, formulating the right call, and incorporating the result back into its reasoning. Toolformer, published by Meta AI in February 2023, is the paper that showed this capability could be learned rather than just prompted.

After writing about giving Claude SSH access to my homelab, I wanted to go back to the research that made tool-using LLMs feel inevitable rather than gimmicky. This is that paper.

The Problem

Language models have a paradox at their core. A model can write a coherent essay about the French Revolution, explain quantum mechanics to a five-year-old, and generate working code — but ask it to multiply 317 by 452 and it might get it wrong. Ask it what happened in the news yesterday and it’ll confidently make something up.

This isn’t a bug. It’s a structural limitation. Language models learn statistical patterns over text. They’re extraordinary at language tasks, but arithmetic isn’t a language task. Neither is looking up current facts, checking today’s date, or translating between languages you have limited training data for.

Before Toolformer, the solutions were clunky. You could fine-tune a model on tool-use demonstrations, but that required massive human annotation. You could use prompt engineering to show the model how to call tools, but that ate up context window and was fragile. The question was: can a model learn to use tools the same way it learns everything else — from data, at scale, without heavy human supervision?

The Core Idea: Self-Supervised Tool Learning

Toolformer’s answer is elegant. Instead of humans labeling when a model should use a tool, the model teaches itself. The key insight: if calling an API makes the model’s predictions better, then that API call was useful and should be kept. If it doesn’t help — or makes things worse — discard it.

This turns tool use into a self-supervised learning problem. You don’t need a human to sit there and annotate thousands of examples of “here’s where you should use a calculator.” You just need a few demonstrations of the API syntax and a way to measure whether the call helped.

The model learns two things simultaneously: when to call a tool (is this a situation where external help would be useful?) and how to call it (what’s the right input to get useful output?). A handful of examples per tool is enough to bootstrap the process.

The Training Pipeline

The pipeline has three steps, and the design is what makes the paper work.

Step 1: Oversample. Take a large text dataset and, for each position in the text, prompt the model to consider whether an API call would be helpful. The model generates candidate API calls — lots of them, far more than will actually be useful. This is intentionally aggressive. You want high recall here; precision comes later.

Step 2: Filter. This is where it gets clever. For each candidate API call, you measure whether having the API result actually helps the model predict the tokens that come after it. Specifically, you compare the model’s loss (how surprised it is by the next tokens) in two scenarios: with the API result included, and without it. If the API result reduces the loss past a threshold — meaning the model makes better predictions with the tool’s output — the call is kept. Otherwise, it’s discarded.

The filtering criterion is doing the heavy lifting. It’s not a human deciding “this is a good tool call.” It’s the model’s own prediction quality — its own loss function — acting as the judge. The tools that survive filtering are the ones that genuinely make the model smarter in that specific context.

Step 3: Fine-tune. Take the surviving API calls, embed them into the original text as special token sequences, and fine-tune the model on this augmented dataset. The model learns to produce these API call tokens naturally as part of its text generation — and at inference time, when it generates an API call token, you pause generation, execute the call, inject the result, and let the model continue.

The result is a model that seamlessly weaves tool use into its language generation. It doesn’t feel like a separate mode. The model just… asks for help when it needs it, mid-sentence.

The Tools

Toolformer was tested with five tools, each targeting a different weakness of language models:

Calculator. For arithmetic that language models are unreliable at. The model learns to call it for computations instead of trying to do math in its weights — things like “the population increased by 23%, from 1.2 million to [Calculator(1200000 * 1.23)] 1,476,000.”
Question Answering (Atlas). A retrieval-based QA system for factual knowledge that might be outside the model’s training data or unreliably memorized.
Wikipedia Search. For looking up specific facts and entities. The model learns to search when it encounters claims it’s uncertain about.
Machine Translation (NLLB). For handling text in languages the model wasn’t primarily trained on.
Calendar. For temporal reasoning — knowing today’s date, computing days between events.

The calculator example is the one that sticks with me. There’s something deeply practical about a model that learns to recognize “I’m about to do arithmetic, and I’m not reliable at arithmetic, so let me call a calculator.” It’s the same thing a competent human does — knowing the boundaries of your own abilities and reaching for the right tool.

Results That Matter

Here’s the headline result: a 6.7 billion parameter GPT-J model, augmented with Toolformer, outperformed the much larger GPT-3 (175B parameters) on several zero-shot benchmarks. A model roughly 25 times smaller, beating the giant because it knew when to ask for help.

This is a significant finding. The implication is that tool access acts as a force multiplier. You don’t necessarily need a model with more parameters and more memorized knowledge. You need a model that’s smart enough to recognize its limitations and use external resources to compensate. A small model with a calculator beats a huge model trying to do arithmetic from memory.

Critically, Toolformer didn’t sacrifice core language modeling ability to gain tool use. The model’s perplexity on standard language modeling tasks stayed essentially the same. Tool use was additive — it made the model strictly better without making it worse at anything.

Connection to Today’s Tool-Using Agents

Reading Toolformer in 2026, it feels like a proof of concept for the world we’re already living in. The paper showed that tool use is learnable. Today’s systems have run with that idea and expanded it enormously.

Claude Code doesn’t use Toolformer’s specific training method, but the principle is the same. When I ask Claude to check disk space on my Pi, it decides to use SSH (the tool), formulates the right command (df -h), executes it, reads the output, and uses that information to continue the conversation. That loop — recognize the need, call the tool, integrate the result — is exactly what Toolformer formalized.

The gap between Toolformer’s five carefully defined APIs and today’s open-ended tool use is vast. Toolformer had a calculator and a Wikipedia search. Claude Code has a full Unix shell, SSH access to remote machines, file system operations, web browsing, and the ability to write and execute arbitrary code. But the conceptual leap — from “LLMs generate text” to “LLMs can interact with the world through tools” — that’s Toolformer’s contribution.

Function calling in modern APIs, ChatGPT plugins, Claude’s tool use, agent frameworks like LangChain and AutoGPT — all of these are downstream of the insight that tool use isn’t a hack bolted onto language models, but a natural extension of what they already do.

What I Found Interesting

The self-supervised filtering is the paper’s real contribution. The idea of giving a model tools isn’t novel. What’s novel is letting the model’s own loss function decide which tool calls are worth keeping. It’s a clean, scalable approach that sidesteps the annotation bottleneck. You don’t need humans to label good tool use — the math does it for you.

The paradox of knowing what you don’t know. For the model to learn to use a calculator, it has to be smart enough to recognize that it’s not smart enough to do the math itself. There’s a beautiful recursion here. The model needs sufficient capability to identify its own incapability — to know that “317 times 452” is a situation where it should ask for help rather than guess. This is a form of calibration that I think is underappreciated.

Tool use as language. Toolformer frames API calls as special tokens in the text sequence. The model doesn’t switch modes or enter a separate tool-use phase. It just generates text, and some of that text happens to be tool invocations. This makes tool use feel like a natural part of language generation rather than a bolted-on capability. That framing has been incredibly influential — it’s basically how function calling works in every major API today.

Open Questions

Tool selection at scale. Toolformer tested five tools. Modern agents might have access to dozens or hundreds. How does a model decide which of 200 available tools to use for a given task? The filtering approach works when you can exhaustively try each tool at each position, but that doesn’t scale to large tool libraries. Tool selection and routing is an active research area that Toolformer doesn’t address.

Trust and verification. Toolformer assumes API results are correct. But what happens when a search engine returns outdated information? When a calculator is called with malformed input? When an API is down? The paper doesn’t engage with the question of whether — and how — a model should verify or question the outputs of its tools. In production systems, this matters enormously.

From constrained to open-ended tool use. There’s a fundamental difference between calling a calculator API with a well-defined input/output format and SSHing into a remote server to run arbitrary shell commands. Toolformer proved the concept with tightly scoped tools. The leap to the kind of open-ended tool use we see today — where an agent can write code, execute it, inspect the results, and iterate — raises questions about safety, reliability, and control that the paper didn’t need to answer but that we very much do.

The trajectory from this paper to where we are now was fast. Less than three years from “a model can learn to call a calculator” to “a model is managing my homelab infrastructure.” Toolformer showed that tool use wasn’t a parlor trick — it was a fundamental capability waiting to be unlocked. The question now isn’t whether LLMs can use tools, but how we govern what tools we give them.