— Blog / Engineering

Feeding large JSON to an LLM without blowing the context window.

Engineering Marek Holub May 12, 2026 9 min read

Pasting a 200MB API response into ChatGPT is a fast way to burn dollars and earn a context-window error. Here's how to feed real-world JSON to an LLM without either.

Why JSON destroys LLM context windows

JSON was designed for machines reading machines. It was not designed to be cheap in tokens. Every {, every ", every duplicated key is paid for in the same currency as your actual signal. A pretty-printed 100KB JSON file lands in the 30k–40k token range under cl100k-style tokenizers — and that's before anything interesting happens.

Three things make it worse:

Repeated keys. An array of 5,000 objects with the same 12 keys is 60,000 tokens of nothing-burger before you read a value.
Long string values. A single base64-encoded image, a stack trace, a 50KB log body — one field can eat a quarter of your context.
Whitespace. A pretty-printer adds 20–40% to your token count for the visual benefit of a machine that doesn't have eyes.

Then you hit the wall. Claude Sonnet and Opus cap at 200k tokens. GPT-4o caps at 128k. Gemini 1.5 and 2.5 will take a million, but you're paying per token the whole way down. At Sonnet's input rate, a single 200k-token query is roughly $0.60. Run that 100 times a day in a debugging loop and you've spent $60 on JSON you didn't read.

Five strategies for shrinking JSON for an LLM

None of these are clever. All five compound.

1. Strip whitespace. Compact-print the JSON before you do anything else. On pretty-printed input, this is a free 20–40% win. Anything that reads JSON can produce compact JSON; the trick is remembering to do it before you paste.

2. Drop noisy keys. The model does not need created_at_microseconds, the SHA of an internal request ID, an auth header echo, or a base64 thumbnail. Use schema inference to find which keys carry data and which carry plumbing. Plumbing keys are usually the longest values.

3. Sample, don't include. If you have an array of 50,000 homogeneous objects, the model learns nothing extra from object 49,997 that it didn't already learn from objects 1–5. Send the schema and a random sample of 3–5 records. State that it is a sample. The model is fine with that.

4. Filter by predicate. When you're debugging, you don't want the whole dataset. You want the records where status == "error", or where retry_count > 3. Filter server-side or pipeline-side before the JSON ever reaches a chat box.

5. Truncate long values. A 50KB log body is not 50KB of signal. The first 256 bytes — the error class, the start of the message — is enough for the model to reason about it. Truncate with a marker like "…(truncated)" so the model knows it's not getting the full string and won't confabulate the rest.

One number to remember. 1 KB of typical JSON ≈ 300–400 tokens. If your file shows 487 KB on disk, assume ~180k tokens of context, then start cutting.

A concrete pipeline with the jb CLI

The five strategies are tool-agnostic. The friction is in stitching them together. Most people end up with a fragile chain of jq filters, python -c one-liners, and tabs-of-shame. The jb CLI was built to do the whole pipeline in one pass — see jq alone won't get you there for why we stopped using it on large files.

Start by looking at what you actually have:

# What's in this thing? Tree shape with types, one pass over the file.
$ jb schema response.json

# Every distinct key name, deduped. Spot the plumbing keys.
$ jb keys response.json
# trace_id
# request_hash
# status
# body
# ...

# Every path in the file, deduped. Spot which arrays carry the payload.
$ jb paths response.json | head

Now you know what's bloating the file. Filter to the records you care about — jb search with --where applies a predicate and --emit object returns the full matching object:

# Only the errors. The predicate evaluates against each iterated property.
$ jb search --where '.status == "error"' --emit object response.json > errors.json

# Or get just one path across the whole file, in JSONL form.
$ jb extract '.results[*].body' --format jsonl response.json > bodies.jsonl

Then run it through the model-shaped formatter. The --ai flag on jb search is a shortcut for --format jsonl --envelope --max-output 1M --max-value-bytes 256: every match becomes a JSONL line of the form {path, preview, value}, total output is capped at 1 MB, and every string value is truncated to a 256-byte preview. Long bodies stop costing you context tokens.

# --ai bundles four flags: jsonl + envelope + max-output 1M + max-value-bytes 256
$ jb search --where '.status == "error"' --ai response.json > for_llm.jsonl

# Or pipe straight to clipboard
$ jb search --where '.status == "error"' --ai response.json | clip       # Windows
$ jb search --where '.status == "error"' --ai response.json | pbcopy     # macOS

# If you need to drop specific keys from the object, pipe through jq.
# jb is for finding and slicing; transforms are jq's job.
$ jb search --where '.status == "error"' --emit object response.json \
    | jq -c 'del(.trace_id, .request_hash, .auth_echo)' > clean.jsonl

Illustrative numbers from a representative 487 MB API-dump payload — your file's mix of long-string keys vs. small-value keys will shift the bytes column, but the order-of-magnitude arc is what matters:

Stage	Bytes	Tokens (est)	Cost @ Sonnet
Raw response	487 MB	~180M	doesn't fit
Filter to `status == "error"`	14 MB	~5.2M	still doesn't fit
Drop noisy keys	3.1 MB	~1.1M	$3.30
`jb search --ai` (envelope + 256B preview cap)	54 KB	~18k	$0.05

From "won't fit" to "five cents and three seconds". The model still gets every error type, every status value, the start of every log body, and enough of the surrounding keys to reason about the shape. It just doesn't get the 41KB of stack traces from each of 48,000 records.

When to stream instead of paste

This whole post is about ad-hoc work: debugging an incident, exploring a new API, running a one-off eval. If you're in production — an agent loop, a RAG pipeline, a customer-facing app — you should not be hand-pasting JSON. You should be using function calling, embeddings, or a retrieval layer, all of which sidestep the context-window problem instead of fighting it.

Paste-the-JSON is a developer tool. Treat it as one.

A worked example: 200MB API response, $0.05 query

Concrete scenario. The mobile team says a personalization endpoint started returning wrong recommendations at around 02:00 UTC. They send you a 200MB dump of the response payload from a four-hour window. The dump has 73,000 user sessions, each with a recommendations array, telemetry, a feature-vector blob, and a debug log body.

You don't want to scroll through that. You want to ask Claude one question: "Look at these error sessions, what's the common pattern in the feature vectors?"

$ jb schema dump.json
# → 73000 session objects, 14 keys each, feature_vector is the largest by bytes

# Pull the bad sessions as full objects, drop the heavy keys with jq,
# then feed the result back through jb with --ai for envelope + caps.
$ jb search --where '.recommendation_quality < 0.3' --emit object dump.json \
    | jq -c 'del(.debug_log_body, .feature_vector_raw)' \
    | jb search --ai - > bad_sessions.jsonl

$ wc -c bad_sessions.jsonl
# 71 KB. About 24k tokens.

Paste bad_sessions.json into a chat with a one-line prompt: "These are the personalization sessions that scored badly between 02:00 and 06:00 UTC. What's the most common feature-vector cluster among them?"

The model answers in seconds, not minutes, and the bill for that exchange is roughly the cost of one coffee, not one dinner. The raw 200 MB dump wouldn't have fit in any frontier model's context window at all — and even on a hypothetical model that took it, the input cost alone would have run into the low three figures per query before the response started typing. The arithmetic is unkind, but the fix is mechanical.

For more on dealing with the raw file before you compact it, see opening large JSON files without crashing your editor.

Pitfalls to avoid

Don't truncate numerics. A model cannot reason about 3.14159…(truncated). Numbers should be passed through whole or rounded, not chopped. --max-value-bytes (the truncation knob behind --ai) applies to string scalars.
Don't strip keys the model needs as anchors. The keys are how the model knows what each value means. Keep schema-shape keys even when you delete schema-noise keys.
Don't sample only the first N records. Early records are usually setup, not steady state. Take a random sample, or stratify by the key you care about.
Don't trust a single tokenizer estimate. Claude and GPT-4o use different tokenizers (Anthropic's BPE vs OpenAI's o200k_base); the same JSON can be 15% different in token count between them. Use the model's own counter when it matters.
Don't forget output tokens. Your prompt is half the bill. The model's reply is the other half. Compacting input is only worth it if you're not asking the model to dump the whole thing back at you.

The whole stack

Trick	Token reduction	Cost to apply
Compact-print	20–40%	one flag
Drop noisy keys	30–80%	one schema pass
Predicate filter	50–99%	one expression
Truncate long values	40–95%	built into `--ai`
All of the above, piped	typically 99%+	one command

JSON is verbose. LLMs charge by the token. The arithmetic is unkind, but the fix is mechanical: see the shape, drop the noise, truncate the bodies, ship the rest.

Download jsonbolt if you want the GUI + the jb CLI in one install. Free for personal use up to 50MB; Pro is $80/year if you're doing this in anger. The --ai flag is in every tier.

← All posts jsonbolt · v1.4.2