— Blog / Engineering

JSON to CSV from a large file.

Engineering Marek Holub May 12, 2026 9 min read

Marketing wants the user export as a CSV. The file is 8 GB. Pandas refuses to open it. Here's how to convert that huge JSON to CSV without trying to fit a large file into your laptop's RAM.

When JSON-to-CSV is the right tool

Before you start writing parsers, decide whether CSV is even the right destination. JSON to CSV works when the data inside your large file is a homogeneous array of flat-ish records: events, orders, users, log lines, telemetry pings. One row per record, one column per field, repeat ten million times.

It stops working when the records are deeply nested, polymorphic, or schema-drifty. If half your objects have address.geo.lat and half don't, you'll either end up with a CSV that's half empty cells or a flattening rule that quietly loses information. Worse: if a field is sometimes a string and sometimes an array, CSV will pick a fight with you.

For analytical loads — anything you'll later JOIN or aggregate at scale — Parquet or Arrow is almost always a better answer than CSV. Columnar, typed, compressed. But CSV is still the lingua franca for Excel users, ad-hoc psql \copy, BigQuery imports, and the marketing team. So let's get the bytes out cleanly.

Why Pandas, Excel, and json2csv break on huge files

The standard JSON-to-CSV toolkit assumes the file fits in memory. Once you cross a few gigabytes, every default tool starts losing:

The pattern is the same in every case: the tool wants a value, not a stream. JSON is hostile to streaming because the syntax has no record terminator — only structural depth — so naive line-readers can't help you. You need a parser that knows where the array boundaries are.

The streaming approaches

Four ways out, in increasing order of how much code you have to write.

1. Python's ijson + the csv module

ijson is an iterative parser. You hand it a file and a prefix; it yields events as it walks the tree. Combined with csv.DictWriter, you get bounded memory.

import ijson, csv
with open('orders.json', 'rb') as f, \
     open('out.csv', 'w', newline='') as g:
    w = csv.DictWriter(g, fieldnames=['id','amount','status'])
    w.writeheader()
    for row in ijson.items(f, 'data.orders.item'):
        w.writerow({k: row.get(k) for k in w.fieldnames})

Works for any file size. The cost: you write the prefix path by hand, you write the column projection by hand, and you eat Python interpreter overhead on every record. Expect roughly 20–80 MB/s, depending on Python version and record shape. Fine for a one-off; painful at multi-GB scale.

2. jq --stream

jq's streaming mode emits [path, value] pairs instead of building values. You can rebuild rows from the pair stream, but the queries get hostile fast — the canonical "stream then re-aggregate" idiom is six lines of fromstream(truncate_stream(...)) that nobody enjoys writing. Possible. Rarely pleasant.

3. Custom Rust / Go with a streaming parser

serde_json::StreamDeserializer, encoding/json.Decoder, simdjson's iterate API — all of these let you walk records without buffering. Fast and tight. The downside is obvious: you wrote a tool. For one ad-hoc extraction, that's a bad trade.

4. A purpose-built CLI for the slicing, GUI for the CSV

This is the niche jb fills. jsonbolt's CLI mmaps the file, runs the SIMD parser over the structural bytes only, walks a path pattern, and streams the matched subtree as JSON or JSONL. CSV/XML conversion lives in the desktop app — "Convert JSON to CSV and XML" is a built-in step on every tier, including free Personal. The split is deliberate: the CLI fits in pipelines (one binary, plain/jsonl/json out, predicate-aware), and the GUI handles the format conversion with a preview before you commit.

# pull a subtree (path + predicate) into JSONL, then convert to CSV
jb search --where '.status == "paid"' --path '.data.orders[*]' --emit object orders.json > paid.jsonl

# open paid.jsonl in the desktop app and use File → Convert to CSV,
# or pipe through jq -r '@csv' / csvkit for a fully-CLI route

The jb binary is the same parser that powers the desktop viewer, so you get the ~2 GB/s structural throughput from the parsing post on the command line. mmap means RAM tracks 1:1 with the working set, not with file size; streaming output means downstream tools start working before the file is fully scanned.

A worked example: 8 GB orders.jsonorders.csv

Marketing dropped orders.json into S3. It's 8 GB. They want a CSV of paid orders with four columns. Here's the boring sequence.

Step 1. See the shape. jb schema walks the file and prints a tree of keys with types:

$ jb schema orders.json
# Tree shape with type info. `jb shape` is an alias.

Step 2. List every distinct path in the file so you can pick the right one. jb paths is an alias for jb schema --paths-only; output paths come back in jq style, ready to paste back into the next command:

$ jb paths orders.json | head
.
.data
.data.orders
.data.orders[]
.data.orders[].id
.data.orders[].customer_id
.data.orders[].amount
.data.orders[].status
.data.orders[].paid_at
.data.orders[].items[]

Step 3. Filter and emit the matched objects as JSONL. --where applies the predicate at the streaming layer, so filtered-out records never reach the writer. --path scopes the search to a subtree pattern, and --emit object returns whole records rather than just paths:

$ jb search \
    --path '.data.orders[*]' \
    --where '.status == "paid"' \
    --emit object \
    --format jsonl \
    orders.json > paid.jsonl

Step 4. Convert to CSV. Two clean routes — keep the GUI on the laptop and the CLI on the bastion:

# Route A: GUI. Open paid.jsonl in jsonbolt, "Convert JSON to CSV and XML"
#         is a built-in step (free, included in Personal tier).

# Route B: fully CLI. jq emits CSV rows from JSONL straight to stdout.
$ jq -r '[.id, .customer_id, .amount, .paid_at] | @csv' paid.jsonl > paid.csv

# Sanity-check, always.
$ head -n 3 paid.csv
$ wc -l paid.csv
Tool / route8 GB orders.json → CSVPeak RAM
pandas.read_jsonOOM at ~22 GB heap
json2csv (npm)OOM at ~3 GB heap
jq non-stream~14 min17 GB
ijson + csv~6 min~80 MB
jb searchjq @csv~45 s end-to-end~210 MB

Numbers above are illustrative ranges, not a repeatable benchmark — disk speed and warm/cold cache state shift absolutes. The reason the SIMD-parser route is faster isn't magic: the structural pass runs once, the parser doesn't allocate a Python or jq object per record, and the only work that scales with record count is the projection step.

Schema gotchas (and how to survive them)

CSV is treacherous. JSON is more treacherous. Their intersection is where bugs live. A short field guide:

The general rule. Run a head -n 5 on the CSV, then run a full wc -l, then load the first 10,000 rows into the actual destination tool. Three checks. Sixty seconds. Saves a re-export.

Live JSON → CSV for ETL pipelines

The same machinery works on NDJSON streams. If your producer is a Kafka consumer or a process emitting line-delimited JSON, pipe it through jb search for predicate filtering and then through jq for the CSV row projection:

kafka-console-consumer ... \
  | jb search --where '.event == "purchase"' --emit object - \
  | jq -r '[.ts, .user_id, .event] | @csv' >> events.csv

Same path syntax, same predicate engine, unbounded input — and CSV at the end. The full streaming-tail playbook lives in the NDJSON tail post; for batch tradeoffs against jq, see jq alternatives for large files, and for opening these things in a viewer first, opening large JSON files.

One CSV, two minutes, no Python heap

That's the whole shape of it: pick the right JSONPath, filter at the streaming layer, project the columns you want, sanity-check the output. If your toolkit is Pandas and the file is 8 GB, you need a different toolkit. Download jsonbolt — the CLI is bundled — or read the pricing page if you've already burned the afternoon waiting for jq.

← All posts jsonbolt · v1.4.2