— Blog / Engineering

JSON to CSV from a large file.

Engineering Marek Holub May 12, 2026 9 min read

Marketing wants the user export as a CSV. The file is 8 GB. Pandas refuses to open it. Here's how to convert that huge JSON to CSV without trying to fit a large file into your laptop's RAM.

When JSON-to-CSV is the right tool

Before you start writing parsers, decide whether CSV is even the right destination. JSON to CSV works when the data inside your large file is a homogeneous array of flat-ish records: events, orders, users, log lines, telemetry pings. One row per record, one column per field, repeat ten million times.

It stops working when the records are deeply nested, polymorphic, or schema-drifty. If half your objects have address.geo.lat and half don't, you'll either end up with a CSV that's half empty cells or a flattening rule that quietly loses information. Worse: if a field is sometimes a string and sometimes an array, CSV will pick a fight with you.

For analytical loads — anything you'll later JOIN or aggregate at scale — Parquet or Arrow is almost always a better answer than CSV. Columnar, typed, compressed. But CSV is still the lingua franca for Excel users, ad-hoc psql \copy, BigQuery imports, and the marketing team. So let's get the bytes out cleanly.

Why Pandas, Excel, and `json2csv` break on huge files

The standard JSON-to-CSV toolkit assumes the file fits in memory. Once you cross a few gigabytes, every default tool starts losing:

pandas.read_json('file.json') loads the entire file into Python's heap, then materializes a DataFrame on top of it. An 8 GB JSON expands to ~30 GB of CPython objects before to_csv even runs. The OOM killer wins.
json2csv (npm) is single-threaded JavaScript with Node's default 1.5 GB heap. It chokes well before 4 GB. --max-old-space-size=16000 delays the funeral, not the death.
Excel caps at 1,048,576 rows per sheet, won't open a 2 GB anything, and silently truncates without a warning your boss will believe.
jq -r '... | @csv' works in principle. In practice jq reads the whole input into its own value model before evaluating, and on multi-gigabyte files you're staring at minutes-to-hours of CPU. There's --stream, which we'll get to.

The pattern is the same in every case: the tool wants a value, not a stream. JSON is hostile to streaming because the syntax has no record terminator — only structural depth — so naive line-readers can't help you. You need a parser that knows where the array boundaries are.

The streaming approaches

Four ways out, in increasing order of how much code you have to write.

1. Python's `ijson` + the `csv` module

ijson is an iterative parser. You hand it a file and a prefix; it yields events as it walks the tree. Combined with csv.DictWriter, you get bounded memory.

import ijson, csv
with open('orders.json', 'rb') as f, \
     open('out.csv', 'w', newline='') as g:
    w = csv.DictWriter(g, fieldnames=['id','amount','status'])
    w.writeheader()
    for row in ijson.items(f, 'data.orders.item'):
        w.writerow({k: row.get(k) for k in w.fieldnames})

Works for any file size. The cost: you write the prefix path by hand, you write the column projection by hand, and you eat Python interpreter overhead on every record. Expect roughly 20–80 MB/s, depending on Python version and record shape. Fine for a one-off; painful at multi-GB scale.

2. `jq --stream`

jq's streaming mode emits [path, value] pairs instead of building values. You can rebuild rows from the pair stream, but the queries get hostile fast — the canonical "stream then re-aggregate" idiom is six lines of fromstream(truncate_stream(...)) that nobody enjoys writing. Possible. Rarely pleasant.

3. Custom Rust / Go with a streaming parser

serde_json::StreamDeserializer, encoding/json.Decoder, simdjson's iterate API — all of these let you walk records without buffering. Fast and tight. The downside is obvious: you wrote a tool. For one ad-hoc extraction, that's a bad trade.

4. A purpose-built CLI for the slicing, GUI for the CSV

This is the niche jb fills. jsonbolt's CLI mmaps the file, runs the SIMD parser over the structural bytes only, walks a path pattern, and streams the matched subtree as JSON or JSONL. CSV/XML conversion lives in the desktop app — "Convert JSON to CSV and XML" is a built-in step on every tier, including free Personal. The split is deliberate: the CLI fits in pipelines (one binary, plain/jsonl/json out, predicate-aware), and the GUI handles the format conversion with a preview before you commit.

# pull a subtree (path + predicate) into JSONL, then convert to CSV
jb search --where '.status == "paid"' --path '.data.orders[*]' --emit object orders.json > paid.jsonl

# open paid.jsonl in the desktop app and use File → Convert to CSV,
# or pipe through jq -r '@csv' / csvkit for a fully-CLI route

The jb binary is the same parser that powers the desktop viewer, so you get the ~2 GB/s structural throughput from the parsing post on the command line. mmap means RAM tracks 1:1 with the working set, not with file size; streaming output means downstream tools start working before the file is fully scanned.

A worked example: 8 GB `orders.json` → `orders.csv`

Marketing dropped orders.json into S3. It's 8 GB. They want a CSV of paid orders with four columns. Here's the boring sequence.

Step 1. See the shape. jb schema walks the file and prints a tree of keys with types:

$ jb schema orders.json
# Tree shape with type info. `jb shape` is an alias.

Step 2. List every distinct path in the file so you can pick the right one. jb paths is an alias for jb schema --paths-only; output paths come back in jq style, ready to paste back into the next command:

$ jb paths orders.json | head
.
.data
.data.orders
.data.orders[]
.data.orders[].id
.data.orders[].customer_id
.data.orders[].amount
.data.orders[].status
.data.orders[].paid_at
.data.orders[].items[]

Step 3. Filter and emit the matched objects as JSONL. --where applies the predicate at the streaming layer, so filtered-out records never reach the writer. --path scopes the search to a subtree pattern, and --emit object returns whole records rather than just paths:

$ jb search \
    --path '.data.orders[*]' \
    --where '.status == "paid"' \
    --emit object \
    --format jsonl \
    orders.json > paid.jsonl

Step 4. Convert to CSV. Two clean routes — keep the GUI on the laptop and the CLI on the bastion:

# Route A: GUI. Open paid.jsonl in jsonbolt, "Convert JSON to CSV and XML"
#         is a built-in step (free, included in Personal tier).

# Route B: fully CLI. jq emits CSV rows from JSONL straight to stdout.
$ jq -r '[.id, .customer_id, .amount, .paid_at] | @csv' paid.jsonl > paid.csv

# Sanity-check, always.
$ head -n 3 paid.csv
$ wc -l paid.csv

Tool / route	8 GB orders.json → CSV	Peak RAM
`pandas.read_json`	OOM at ~22 GB heap	—
`json2csv` (npm)	OOM at ~3 GB heap	—
`jq` non-stream	~14 min	17 GB
`ijson` + csv	~6 min	~80 MB
`jb search` → `jq @csv`	~45 s end-to-end	~210 MB

Numbers above are illustrative ranges, not a repeatable benchmark — disk speed and warm/cold cache state shift absolutes. The reason the SIMD-parser route is faster isn't magic: the structural pass runs once, the parser doesn't allocate a Python or jq object per record, and the only work that scales with record count is the projection step.

Schema gotchas (and how to survive them)

CSV is treacherous. JSON is more treacherous. Their intersection is where bugs live. A short field guide:

Missing keys. A record without paid_at becomes an empty cell when you list it in your jq projection. Sensible default. Confirm it matches what your downstream consumer expects.
Nested objects. Project them explicitly in the jq step: .address.city, .address.geo.lat. Inside the GUI's CSV converter, nested objects are flattened with dotted keys.
Arrays inside records. Two options. Join into a single cell ((.tags | join("|")) in jq), or explode each element into its own row by iterating in the path (.orders[].items[] instead of .orders[]). Pick deliberately; mixing the two confuses people downstream.
Mixed types per key. If a field is sometimes a number and sometimes a string, coerce to string ((.amount | tostring)) and log warnings. Silent coercion costs you a Monday.
Escaping. Commas, quotes, newlines inside string values. Follow RFC 4180 — quote the field, double the internal quotes. jq -r @csv handles this correctly. Read the output back into your target tool before you ship it; Excel and BigQuery disagree about edge cases, and they both think they're right.

The general rule. Run a head -n 5 on the CSV, then run a full wc -l, then load the first 10,000 rows into the actual destination tool. Three checks. Sixty seconds. Saves a re-export.

Live JSON → CSV for ETL pipelines

The same machinery works on NDJSON streams. If your producer is a Kafka consumer or a process emitting line-delimited JSON, pipe it through jb search for predicate filtering and then through jq for the CSV row projection:

kafka-console-consumer ... \
  | jb search --where '.event == "purchase"' --emit object - \
  | jq -r '[.ts, .user_id, .event] | @csv' >> events.csv

Same path syntax, same predicate engine, unbounded input — and CSV at the end. The full streaming-tail playbook lives in the NDJSON tail post; for batch tradeoffs against jq, see jq alternatives for large files, and for opening these things in a viewer first, opening large JSON files.

One CSV, two minutes, no Python heap

That's the whole shape of it: pick the right JSONPath, filter at the streaming layer, project the columns you want, sanity-check the output. If your toolkit is Pandas and the file is 8 GB, you need a different toolkit. Download jsonbolt — the CLI is bundled — or read the pricing page if you've already burned the afternoon waiting for jq.

← All posts jsonbolt · v1.4.2