harnesslog.dev

Claude Code, AI, and development stories

EN · KO
H
hwangjungmin

1M context sounds great. Just throw everything in and let the model sort it out, right?

Not really. MRCR — a benchmark that tests whether models can actually find specific info in long context — tells a different story. At 1M tokens, the previous generation scored around 17%. Even today’s best barely hits 76%. The model loses track of things way more than you’d expect.

It’s kind of like searching for one email in an inbox with a million messages.

I keep my context around 20k. At that size, retrieval is nearly perfect and responses stay sharp. Bigger isn’t better here — tighter is.