The Inverted Efficiency Curve: When More Work = More Speed
In computer science, we are used to "Linear Time" ($O(n)$). If you double the input, you double the work, and usually, you double the time taken.
In the world of LLM inference, we intuitively expect the same. Processing a 32,000 token prompt should feel heavier and slower than processing a 4,000 token prompt.
But when we fired up The Neural Lab benchmarks this week, two of our rigs defied physics.
The Anomaly
We tested Google's Gemma 3 (4B) across our fleet. We measured Prompt Evaluation Speed—how fast the chip digests the text you send it before it starts speaking.
Here is the data that made us double-check our sensors:
| Rig | 4k Context (t/s) | 32k Context (t/s) | Change |
|---|---|---|---|
| M4 Max | 1,540 | 1,601 | 🟢 +4% |
| RTX 4060 | 2,990 | 3,366 | 🟢 +12% |
Read that again. The Compact Striker (RTX 4060) actually got 12% faster when we increased the workload by 8x.
The Explanation: Compute Saturation
Why does this happen? It comes down to Fixed Overhead vs. Parallel Saturation.
Modern GPUs (and Apple's Neural Engine) are massive parallel calculators. They have thousands of cores waiting to do math. When you send a "small" batch of data (like 4,000 tokens), the time spent preparing the data, dispatching instructions, and managing memory (the "Overhead") is a significant percentage of the total timeframe. The cores spend a fraction of a millisecond waiting for orders.
When you send a massive 32,000 token batch, you finally give those cores enough work to chew on. The overhead remains the same, but it is now amortized over a much larger volume of useful math. The usage efficiency of the chip goes from ~70% to ~99%.
The Counter-Example: The Veteran
To prove this, look at our legacy hardware, The Veteran (GTX 1080):
- 4k Context: 1,259 t/s
- 32k Context: 984 t/s (🔴 -22%)
The older Pascal architecture lacks the massive parallel throughput of modern Ada Lovelace or M4 chips. It gets saturated early (likely at 1k-2k tokens). So when you pile on more work, it simply chokes.
Conclusion
If you own modern hardware (RTX 40-series, M3/M4), do not be afraid of large contexts.
Your hardware was built to move mountains. If you only ask it to move a pebble, it spends half its time wondering why you woke it up.