Burn less, ship more: the case for token optimization
Token optimization is the new tokenmaxxing. Here's why burning fewer tokens produces better software and why the economics of AI make this shift inevitable.
Goodhart's Law, applied to tokens
For a few strange quarters, parts of the tech industry decided that the best way to measure AI adoption was to count how many tokens engineers burned. This trend was called tokenmaxxing.
Meta built an internal leaderboard ranking all 85,000+ employees by token consumption. Top users earned titles like "Session Immortal" and "Token Legend." In one 30-day period, Meta employees burned through 60.2 trillion tokens, a number that, at standard Anthropic API pricing, would cost around $900 million. Even at enterprise discount rates, the bill likely ran to $100 million or more. A meaningful chunk of that, by all accounts, was deliberate waste.
Microsoft ran a similar leaderboard. Salesforce set minimum monthly token spend targets and made everyone's spend visible to their teammates.
The message was clear: use enough tokens, or get flagged. But the results were also entirely predictable.
The Pragmatic Engineer reports this quote from a software engineer at Microsoft:
“I am conscious of not wanting to be seen as “uses too little AI,” and I’m not ashamed to say I need to do tokenmaxxing to do this. Things I do to inflate my token usage metrics: Ask AI questions about the code already in the documentation. The AI pulls up the documentation, processes it, and gives me results 10x slower, but while burning lots of tokens. I could use “readthedocs” [an internal product], but then my token numbers would be lower. Ask the AI to prototype a feature that I have no intention of working on. Prompt it a few more times, then throw the whole thing away. Default to always using the agent, even when I know I could do the work by hand much faster. Then watch it fail”
This engineer’s behavior is the logical output of a badly designed incentive: someone who knows exactly what they're doing, knows it's wasteful, and does it anyway because the alternative is being tagged as insufficiently AI-native.
This isn't the first time the industry has run headfirst into Goodhart's Law: when a measure becomes a target, it ceases to be a good measure. Tokenmaxxing is the spiritual successor to "lines of code shipped" a regression to the pre-DORA era of developer productivity measurement, where what got counted was an input to the process, not an outcome of it.
Token leaderboards are already coming down and, this trend, hopefully, is in the rear-view mirror. What replaces it matters and the case for token optimization is stronger than most teams realize.
Three reasons token optimization is now inevitable
There are three structural forces making the correction from tokenmaxxing to token optimization inevitable:
The cost will catch up
Tokenmaxxing as a productivity measure might seem merely wasteful while AI providers are still subsidizing adoption. It will be fiscally irresponsible and embarrassing to justify when the bills start reflecting what inference actually costs.
AI providers are pricing tokens to drive developer adoption, not to recover their infrastructure costs. The economics of training and running large models at scale are still deeply unfavorable, and every indication is that the current pricing represents a land-grab phase, not a sustainable market rate. When prices eventually reflect real costs, teams that have built workflows around burning tokens freely will face a reckoning they haven't budgeted for.
Even now, CFOs are starting to see AI tooling costs appear as a significant line item without a corresponding improvement in the metrics most engineering managers care about. The data doesn't make a comfortable case for this level of continued spending: incidents per PR up 242%, bugs per developer up 54%, PR revert rates flat despite record token consumption. The productivity gain is real at the individual level (more code shipped, more tasks closed) but more tokens are not buying more reliable, higher quality software.
More tokens, same problems
The Jellyfish AI Engineering Trends data makes the point plainly: engineers with the largest token budgets produced the most pull requests, but productivity improvement didn't scale with token spend. The correlation between token consumption and output quality simply isn't there.
Tokenmaxxing behavior (e.g. running agents on loosely specified tasks or burning through context with throwaway prototypes) is part of the problem. But the deeper issue is structural.
We are trying to solve a data problem with more tokens and better models. An AI agent is only as good as the environment it operates in, and it requires precise, complete runtime data about your system to reason correctly about what broke and how to fix it.
When you bolt an AI agent onto an observability stack designed to give humans dashboards for assessing system health, the agent inherits all the limitations of that architecture: sampled and aggregated data, siloed context, and no correlation across the system boundaries where complex failures actually live.
Closing that gap requires moving away from passive data dumping and restructuring the data layer for machine consumption: capturing full-fidelity, pre-correlated session data at the source, before it ever reaches an agent.
The environmental cost is real and growing
This one gets less attention, but it will increasingly show up in enterprise decision-making. Large-scale AI inference carries a meaningful energy footprint (data center power consumption, water cooling, hardware cycling) and the numbers are growing as adoption scales. The data is still noisy and methodologies vary, but the direction is unambiguous.
This matters for two reasons. First, many enterprises have made public sustainability commitments that will eventually come into conflict with "maximize token spend" as an engineering culture norm. Second, regulators in the EU and elsewhere are beginning to require disclosure of AI energy use in corporate reporting. "We ran 60 trillion tokens last month" is going to read differently in a sustainability report than it does on a leaderboard.
What token optimization actually looks like in practice
Token optimization is about being precise: giving agents exactly what they need to do their best work, and nothing more. Here are four concrete strategies engineering teams are already using.
Give agents the right data, not all the data
The most direct path to token waste is also the most common: feeding agents everything available and hoping they'll figure out what's relevant. An agent debugging a production bug doesn't need everything the observability stack collected since last Tuesday. It needs the unsampled, full-stack session data scoped to the specific failure: pre-correlated, deduplicated, ready to reason about.
This is the data problem that runs underneath the token problem. When agents work from sampled, aggregated, siloed observability data, they burn tokens reconstructing context that was never complete to begin with, and still produce fixes that miss the actual root cause. Precision beats volume. Not just for cost reasons, but because the quality of the output depends entirely on the quality of the input.
Use the right model for the right task
Frontier models are extraordinary at complex reasoning, architectural decisions, and multi-step problem solving. They're also expensive, and they're not necessary for most of what agents actually spend time on.
A practical token optimization strategy treats model selection like any other resource allocation decision. Use a smaller, faster, cheaper model for boilerplate generation, code formatting, routine refactoring, and low-stakes tasks. Reserve frontier models for the work that actually requires them: root cause analysis, architectural judgment, anything where the reasoning chain is long and the stakes of getting it wrong are high. You don't need a Ferrari to go grocery shopping.
Context minimalism
Cat Wu, Head of Product for Claude Code at Anthropic, calls herself "a context minimalist": tell the model only what it needs, then let it choose the route. Every file read, rule, hook, skill, and subagent changes what the model is carrying around, and the more it carries, the more tokens every exchange costs.
This represents a maturity shift in how teams think about working with agents. The early instinct was context engineering: curate everything you might possibly need and hand it all over. The current approach tends more toward context minimalism: trust the agent to find what it needs, and resist the urge to front-load the context window with everything that could conceivably be relevant.
Anthropic's own team confirmed this shift when they deleted around half the system prompt for Claude Code: the information wasn’t wrong, but newer models no longer needed it spelled out.
Teach your agents to be less verbose
Agents left to their own devices write more code than necessary. They install (or create!) libraries for problems the standard library already solves. They create abstractions for things that don't need abstracting. They generate fifty lines when five would do… and every line costs tokens to produce and tokens to review.
Skills like Ponytail and Caveman are built around the YAGNI (You Aren't Gonna Need It) philosophy. Ponytail, for example, forces the agent to work through a hierarchy before writing a single line: does this need to exist? Does the standard library handle it? Is there a native platform feature? Is there an existing dependency? Only after exhausting those options does it write code, and then only the minimum that works.
The benchmarks are striking: 80-94% less code, 47-77% less cost, and 3-6x faster than an unconstrained agent. The project's tagline captures the philosophy better than any benchmark: "the best code is the code you never wrote."
Final thoughts
Every technology platform goes through this arc. The gold rush phase is about maximizing usage, learning what the tools can do, and pushing limits. The discipline phase is about doing more with less.
The token leaderboards are already coming down. The correction is underway. Token optimization is the natural next step in learning to work with these tools well. The teams that figure it out first will have lower AI bills, but most importantly, they'll have better software.
One copy/paste in your terminal and the debugging agent is running:
npm install -g @multiplayer-app/cli && multiplayer
Rather explore first?👇