LLMs Can't Count — Engineering Practices for Output Length Control
36 controlled experiments reveal: counting metric mismatch, reasoning token squeeze, and small-sample hallucination — three pitfalls stacked together cause 93% of outputs to exceed limits.
You tell an LLM to “write 2,500 words.” It can’t count.
It’s not a prompt issue. It’s not a parameter tuning issue. The autoregressive architecture predicts only the next token at each step — there’s no internal counter. The model’s “perception” of length comes from distribution patterns in training data, not precise calculation.
It took me two weeks and 36 controlled API calls to truly accept this.
I. 93% Over-Limit
In a Chinese long-text batch generation pipeline, each call requested approximately 2,500 Chinese characters (with structured outlines and contextual transitions). After running for a while, I audited 1,200+ outputs:
| Target | Samples | Over-limit Rate | Avg Overshoot |
|---|---|---|---|
| 2500 | 331 | 93% | +1,685 chars |
| 2900 | 526 | 31% | +420 chars |
| 3000 | 356 | 61% | +690 chars |
The distribution is heavily right-skewed — almost always too long, rarely too short. This isn’t random error; it’s systematic bias.
II. You and the Model Are Counting Different “Characters”
The system counted characters using len(re.sub(r'\s', '', text)) — all characters after removing whitespace. The prompt said “write 2,500 characters (字).”
The model interprets “字” as Chinese characters. The system counts all characters including punctuation, digits, and letters.
Sampling 90 outputs:
| Character Type | Percentage |
|---|---|
| CJK Characters | 84.5% |
| Punctuation | 12.1% |
| Digits + Letters + Others | 3.4% |
The model writes 2,500 Chinese characters, the system counts 2,950. The model writes 2,600 Chinese characters, the system counts 3,068 and flags it as over-limit. An 18% systematic bias — the model didn’t write too much; the measurement was wrong.
After switching to counting only CJK Unified Ideographs, the over-limit rate for target=2900 dropped from 31% to zero.
def count_cjk_chars(text): return sum(1 for c in text if '\u4e00' <= c <= '\u9fff' or '\u3400' <= c <= '\u4dbf' or '\U00020000' <= c <= '\U0002a6df')One function change was more effective than all the prompt engineering that followed combined.
This type of bug is insidious because both sides “look correct.” The prompt says “characters,” the developer thinks “characters are characters” — each is reasonable, but combined they produce an 18% phantom bias. In multi-module systems, subtle drift in definitions of the same concept across components is a classic integration bug.
III. Cutting Token Budget Makes Output Longer
The model supports thinking mode, with max_completion_tokens initially set to 10,000. Intuitively, cutting to 5,000 should shorten output.
The result was the exact opposite:
| max_tokens | thinking | Reasoning Tokens | Output Tokens | Actual CJK Chars |
|---|---|---|---|---|
| 10000 | on | 1204 | 2077 | 3041 |
| 6000 | on | 689 | 2678 | 3840 |
| 5000 | on | 502 | 2568 | 3765 |
| 10000 | off | 0 | 2988 | 4313 |
| 5000 | off | 0 | 2032 | 3005 |
Cutting max_tokens from 10,000 to 5,000 with thinking on, the character count increased from 3,041 to 3,765.
The reason: reasoning tokens and output tokens share the same budget pool. Under budget pressure, the model cuts reasoning first (1204→502), but reasoning is precisely the capability that lets the model plan overall structure and sense “when to wrap up.” With reasoning compressed, the model doesn’t get the chance to think “time to stop” and just keeps writing.
max_tokens: 10000 → 6000 → 5000Reasoning tokens: 1204 → 689 → 502 (↓)Output tokens: 2077 → 2678 → 2568 (↑)Actual CJK chars: 3041 → 3840 → 3765 (↑)The thinking=off group showed that cutting max_tokens was indeed effective (4313→3005) because there’s no reasoning overhead — the physical cap takes direct effect.
Conclusion: On models with reasoning capability, token budget is a non-monotonic control variable. There exists a “reasoning sufficiency” threshold below which the constraint actually weakens.
IV. The Hallucination of Two Samples
After fixing the counting metric, I tried three prompt variants:
- baseline: “Target 2500, range 2000–3000”
- strict: Added “exceeding will trigger a discard and rewrite, seriously wasting compute”
- countdown: Split 2500 into four section budgets, each with a specified character count
First ran two samples — all three variants showed 100% compliance. Almost committed right away.
Ran two more samples:
| Config | Round 1 (n=2) | Round 2 (n=4) |
|---|---|---|
| strict + 10k | 100% | 25% |
| countdown + 10k | 100% | 25% |
| baseline + 10k | 100% | 0% |
Summary of 36 calls (CJK output distribution at target=2500):
Min: 1671 Max: 4247Mean: 3087 Std Dev: 520Compliance (≤3000): 33%With a standard deviation of 520, the “100%” from 2 samples is pure statistical noise.
There’s an even more hidden trap: the model silently ignores the temperature parameter in thinking mode, forcing 1.0. The API doesn’t error or warn. I thought I was testing differences between temp=0.6 and temp=0.8, but both groups ran at 1.0 — all “conclusions” about temperature were void.
Iron rule of LLM experiments: verify that the parameter you changed actually took effect. An API’s silent failure is more dangerous than an explicit error.
V. 0% Success Rate for “LLM Compression”
The system had another “safety net”: when over-limit, call the LLM to “compress to under 3,000 characters.”
Compression attempts: 174Successes: 0Success rate: 0%A model that can’t count precisely — asking it to “trim to a precise word count” can never converge.
VI. Root Cause Ranking
| Root Cause | Contribution |
|---|---|
| Counting metric mismatch (all chars vs CJK) | ~18% |
| Model’s inherent uncontrollable output rhythm | ~60% |
| LLM compression pipeline failure | ~20% |
| Post-processing escape paths (polish/review rewrites without re-validation) | ~2% |
VII. Production Solutions
1. Align Counting Metrics
Whatever the code counts, the prompt should state explicitly. If counting CJK characters, write “number of Chinese characters,” not “word count.” This step has the highest ROI.
2. Set Target in the Model’s Comfort Zone
The model naturally produces 3,000–3,500 CJK characters. Setting target at 2,500 is asking it to brake at 70% of its natural output — it can’t do it. Setting 2,900, the compliance range covers the natural interval, achieving ~95% compliance.
| Target | Natural Output | Compliance Range | Expected Rate |
|---|---|---|---|
| 2500 | 3000-3700 | 2000-3000 | ~25% |
| 2900 | 3000-3500 | 2400-3400 | ~95% |
| 3000 | 3000-3500 | 2500-3500 | ~85% |
3. Add Convergence Anchors to Prompts
“Reach the climax at 50%, begin wrapping up at 65%.” Not guaranteed to work perfectly, but reduces extreme deviations.
4. Over-Limit Retry with Feedback, Not Truncation
Truncation cuts logic at an arbitrary point — unacceptable quality loss. Retry with specific numbers fed back to the model — “Last attempt was 3,500, limit is 3,000, please write shorter.” More effective than blind retries.
5. Log Token Details
Reasoning tokens, output tokens, finish_reason, CJK character count. Provides a baseline for regression testing after model upgrades.
VIII. Looking Back
The essence of this problem is using a probabilistic system to enforce deterministic constraints.
LLM applications carry a common implicit assumption — “the model can follow instructions precisely.” This largely holds for classification and extraction tasks where the output space is small. But in long-text generation, the output space is exponential; requiring precise length control is like drawing a narrow band in high-dimensional space — the generation process has no such constraint mechanism.
From a control theory perspective, most LLM pipelines are open-loop systems — generate once and done. Adding retry + feedback converts open-loop to closed-loop: generate → count → evaluate → feedback → regenerate. When precision isn’t enough, iterate to compensate.
On experimental methodology: LLM output standard deviation far exceeds traditional software. For distributions with std dev 500+, you need dozens of samples to distinguish 10%-level differences. Most people run two or three cases and draw conclusions — not because they don’t understand statistics, but because API calls are expensive. But the money saved on API calls comes back doubled as production bugs.
All data above is based on mimo-v2.5-pro. Specific numbers (over-limit rates, reasoning token allocation, natural output ranges, etc.) do not apply to other models. Core conclusions — counting metric alignment, non-monotonic reasoning budgets, iterative convergence over truncation — are methodologically universal, but thresholds need to be re-measured on target models.