VibeThinker-3B — Notes & Takeaways

A 3B model that scores 94 on AIME and matches Gemini 3 Pro on math and code — while openly admitting it can't hold broad knowledge. The interesting part isn't the benchmark; it's the hypothesis underneath it.

Read original paper →

I run models on my own hardware, so I’ve internalized a rule of thumb: for anything that requires real reasoning, the small local model is a toy and the frontier model is the tool. A 3B model could autocomplete and summarize; if I wanted it to think — work a multi-step math problem, debug code it hadn’t seen — I reached for something I didn’t host. That gap felt permanent, a tax you pay for not owning a datacenter.

VibeThinker-3B is the paper that made me question whether the gap is about size at all. Published by the WeiboAI team in June 2026, it’s a 3-billion-parameter dense model that posts numbers I would have flatly disbelieved a year ago — and, more usefully, it comes with a hypothesis for why a model this small can do this, and an honest account of what it gives up in exchange.

The numbers

Straight from the paper, on verifiable reasoning tasks:

  • 94.3 on AIME26 (the 2026 American Invitational Mathematics Examination), rising to 97.1 with claim-level test-time scaling
  • 80.2 Pass@1 on LiveCodeBench v6
  • 96.1% acceptance on recent, unseen LeetCode contests
  • 93.4 on IFEval, the instruction-following benchmark

The authors place this “in the performance band of first-tier reasoning systems, matching or exceeding flagship models that are orders of magnitude larger, such as DeepSeek V3.2, GLM-5, and Gemini 3 Pro.” A 3B model trading blows with Gemini 3 Pro on competition math. That is the kind of claim that should set off alarms, so let’s treat it like one.

Why the LeetCode number is the one that matters

Whenever a tiny model posts frontier math scores, the first and correct reaction is contamination. AIME problems and their solutions are all over the training-data internet; a model can look like it’s reasoning when it’s really reciting. The 1.5B predecessor to this model kicked off exactly that argument when it landed — the AI world spent a week relitigating what benchmarks even mean.

This is why the 96.1% acceptance rate on recent, unseen LeetCode contests is the result I actually weight. You can’t memorize a contest that didn’t exist when you trained. Out-of-distribution generalization on freshly released problems is the hard-to-fake version of the claim, and it’s the number I’d want to reproduce myself before fully believing the rest. The 93.4 on IFEval matters for a different reason: it says the model didn’t get tunnel-visioned into a math savant that can’t follow a plain instruction — a common failure mode when you optimize this hard for one capability.

The actual idea: compression vs. coverage

The benchmarks are the headline, but the part worth keeping is the framing the authors build to explain them — the Parametric Compression-Coverage Hypothesis. The claim, roughly:

Verifiable reasoning compresses into a compact “reasoning core.” Open-domain knowledge and general competence do not — they need broad parameter coverage across facts, concepts, and long-tail scenarios.

This reframes the whole small-vs-large debate. The reason a 3B model can match Gemini 3 Pro on AIME but not replace it is that they’re being asked to store two different kinds of thing. Math and code reasoning is a procedure — a relatively small set of transferable moves applied recursively. You can pack that into a few billion parameters. But knowing who won an obscure election, or the API surface of some library, or ten thousand long-tail facts — that’s coverage, and coverage is what large models spend their parameters on.

So VibeThinker isn’t claiming small models are secretly as good as big ones. It’s claiming reasoning and knowledge are separable, and that one of them is far more compressible than we treated it. Small models, in this view, aren’t budget substitutes — they’re a complementary path that’s genuinely frontier-level inside the band of verifiable, procedural tasks.

How they got there

The method is post-training, not a new architecture — they take a base model and shape it through what they call the Spectrum-to-Signal paradigm: curriculum-based supervised fine-tuning, multi-domain reinforcement learning, and offline self-distillation. The intuition I take from it (and the earlier 1.5B work) is that you first deliberately widen the model’s diversity of solution attempts — the spectrum — and then use RL to sharpen that spectrum into a reliable signal, rather than collapsing early onto one mediocre strategy. Squeezing a strong reasoning core out of a small model is apparently as much about how you train as how big you build.

What it means for someone with a homelab

This is the part I keep turning over. A 3B reasoning model fits on hardware I already own, with room to spare. If the OOD results hold up, the calculus for a lot of my local tooling changes: the agent that reads logs and reasons about a failing node, the thing that triages a stack trace — those are verifiable, procedural tasks, exactly the band where this model claims to live. I don’t need broad world knowledge for that work; I need a tight reasoning core that runs offline and fast.

The honest caveat is the one the paper hands you itself: do not point this at open-domain questions and expect frontier answers. By its own hypothesis, the coverage isn’t there. The right mental model is a specialist, not a shrunken generalist — and the trap would be benchmarking it on AIME, getting excited, and then quietly being disappointed when it can’t do the general-assistant things a big model does. Used for what it’s for, though, it’s the strongest argument I’ve seen that “intelligence per parameter” still has a long way to climb.

It also rhymes with the Recursive Language Models note: both are evidence that the lever marked bigger is not the only one, and probably not the most interesting one, left to pull.

Paper: VibeThinker-3B (arXiv:2606.16140).