December 15, 2024

ClickHouse's Native JSON: 6.5x Faster and I Wish I'd Had This at Baselime

For the past four years, teams building observability tools for unstructured logs have been using one of two hacks with ClickHouse to query JSON data without defining schemas upfront.

The hacks:

JSON as pairwise arrays — Uber's approach
JSON as a string — PostHog's approach

I've lived with both. At Baselime we spent a year with Boris trying to build the perfect observability tool, and let me tell you — watching queries crawl across millions of log lines while customers waited was not the vibe.

ClickHouse 24.8 shipped a native JSON type. They published a benchmark against Mongo, Elastic, and friends but didn't compare it to the old hacks. So I had to do it myself.

The Setup (Skip This)

Downloaded 100M rows from the Bluesky dataset. RIP ClickHouse Cloud's AWS bandwidth bill — did nobody tell them R2 has free egress?

wget --timestamping --directory-prefix ~/data/bluesky \
  --input-file <(seq --format "https://clickhouse-public-datasets.s3.amazonaws.com/bluesky/file_%04g.json.gz" 1 100)

Loading as a string? Easy. Loading as pairwise arrays? Had to get Claude to write me a janky Python script because I couldn't figure out how to do it in pure SQL. Loading as native JSON? One parseJSON() call. Already winning.

The Numbers

Analytics Query (Group By)

Approach	Time	Data Read
Pairwise arrays	13.19s	66.58 GB
String + JSON_VALUE	11.76s	48.61 GB
Native JSON	2.00s	6.59 GB

6.5x faster. And look at that data read difference — 6GB vs 66GB. The native type only touches the columns it needs.

Needle in Haystack (Find One Record)

Approach	Time	Data Read
Pairwise arrays	12.63s	66.58 GB
String + JSON_VALUE	10.76s	48.61 GB
Native JSON	1.65s	8.04 GB

7.7x faster. This is the "why is this customer's request failing" query. The one you run at 2am. 1.6 seconds vs 12 seconds matters.

Storage

Table	Compressed
String	19.28 GiB
Pairwise arrays	18.65 GiB
Native JSON	18.53 GiB

Basically identical. No tradeoff here.

The Syntax Is Also Just... Better?

Pairwise arrays:

SELECT string_values[indexOf(string_names, 'commit.collection')] as collection
FROM bluesky_kv

String:

SELECT JSON_VALUE(json, '$.commit.collection')
FROM bluesky_string

Native JSON:

SELECT json.commit.collection
FROM bluesky_json

That last one looks like how you'd write it if ClickHouse just understood JSON. Because now it does.

Should You Migrate?

Yes.

6-7x faster queries. Same storage. Cleaner syntax. Easier ingestion. The native JSON type is still marked "experimental" but honestly, at this point, so is running observability on ClickHouse at all — and that hasn't stopped any of us.

If you're building logging or observability tools on ClickHouse and still using the string or pairwise hacks, you're leaving massive performance on the table. I wish I'd had this at Baselime. Would've saved some late nights.

Go migrate your stuff.

100M rows from the Bluesky JSON benchmark. Queries run on ClickHouse Cloud.