Skip to content

writing

gzip ate my byte ranges

A Parquet viewer worked in curl and on github.io, then served zero rows on the custom domain. The culprit: the CDN gzipped the file, breaking Range requests.

· 8 min read

  • #parquet
  • #http
  • #github-pages
  • #debugging
  • #war-story

I shipped a small read-only site this morning — a static job-data dashboard on GitHub Pages. No backend, no database, no server I have to keep alive: just a Parquet file sitting next to the HTML, and a browser-side viewer that reads it directly with hyparquet. The clever part is that the viewer never downloads the whole file to list a few rows. Parquet keeps its schema and row-group metadata in a footer at the end of the file, so a reader can issue a HEAD to learn the length, then a couple of HTTP Range requests to pull just the footer and the columns it needs. Fetch kilobytes, not megabytes. Lovely.

It worked perfectly in curl. It worked perfectly on the raw *.github.io URL. Then I pointed the custom domain at it, opened it in an actual browser, and got zero rows. Not an error page — a clean, confident, empty table. The header stats at the top of the dashboard still rendered fine. Everything looked healthy except for the part where there was no data.

This is the story of why, because the answer is a genuinely sharp corner of HTTP that I’d never been bitten by before.

The symptom that lied

The first thing that threw me was that the page wasn’t broken. It rendered. The summary cards across the top — total rows, last-updated, a couple of counts — all showed correct numbers. Only the table was empty. When a page is half-right, your brain wants to believe the wrong half is a small bug, not a different failure entirely.

So I opened the console and found the real message, thrown from deep inside the Parquet reader:

Error: parquet file invalid (footer != PAR1)

Every valid Parquet file ends with the four magic bytes PAR1. The reader seeks to the end, reads the last chunk, checks the magic, and uses it to locate the footer metadata. This error means: I went to where the footer should be, and the bytes there weren’t the footer. The file was fine — I could download it whole and parquet-tools was perfectly happy with it. But the reader, fetching it by range over this particular host, was landing on garbage.

Why curl was a false friend

Here’s the detail that cost me the most time: curl could not reproduce it. I’d curl-ed the range requests by hand, byte offsets and all, and got back exactly the right bytes every time. On the github.io URL and the custom domain. As far as my terminal was concerned, there was no bug.

The browser disagreed, loudly. So the difference had to be something the browser sends that curl doesn’t. There’s really only one big one:

Accept-Encoding: gzip, deflate, br

Browsers always offer compression. curl, by default, does not — unless you pass --compressed, it sends no Accept-Encoding at all, so the origin hands back the raw, identity-encoded bytes. That single missing header was the whole reason my command line and my browser were looking at two different responses to the “same” request. The moment I added --compressed to curl, it broke there too. Reproduction in hand.

The actual bug: ranges over a compressed transport

GitHub Pages sits behind a CDN, and that CDN will happily gzip assets on the fly — including, it turns out, a .parquet file. When the browser sends Accept-Encoding: gzip, the response comes back with:

Content-Encoding: gzip
Content-Length: 14087

But the uncompressed file is 18,929 bytes. Two different sizes for the same resource, and that gap is the entire bug.

Walk through what the Parquet reader does. Its mental model of the file is the uncompressed layout — that’s where PAR1 lives, at the very end, around offset 18,929. So it computes a byte range against 18,929 and asks for, say, bytes [18,888 – 18,929) to grab the footer.

Now the CDN receives that Range request and applies it to the representation it’s actually transferring — the gzip stream — not to the original file. Per the HTTP spec, a byte range names bytes of the selected representation (the thing being sent over the wire), and when Content-Encoding: gzip is in play, that representation is the compressed bytes. So a request for “bytes 18,888 onward” against a 14,087-byte compressed body is nonsense: at best you get the tail of a gzip member, at worst an unsatisfiable range.

The reader gets back some bytes — compressed gzip bytes, from the wrong place — checks for PAR1, finds it isn’t there, and throws footer != PAR1. Zero rows. The offset was computed against the uncompressed size; the bytes were served from the compressed stream. Footer and request never lived in the same coordinate system. Range requests and transparent compression are fundamentally incompatible — and not by anyone’s bug, but by the layered definition of what a byte range even means. The range applies to the transfer representation, and transparent gzip changes that representation out from under a reader that’s reasoning about the original file.

(And the header stats that looked fine the whole time? They came from a separate little meta.json I’d written alongside the Parquet file. JSON, small, no ranges, no problem. The dashboard “said everything was fine” because the part that was fine was reading from a different file than the part that was broken. A summary sourced from somewhere else is not a health check on the thing it’s summarizing.)

The fix is boring, and that’s the point

I briefly went looking for a way to tell GitHub Pages “don’t gzip this one file” or “honor ranges on the identity representation.” You don’t get that knob on Pages. The CDN’s content-negotiation is not mine to configure.

So I stopped fighting the transport and changed the client. The file is 18 KB. The whole reason for the range dance is to avoid pulling a large file when you only need a slice — but at 18 KB there is no large file to avoid. A HEAD plus two ranged GETs is three round trips; a single GET of the entire thing is one, and it sidesteps the compression problem completely because there’s no offset math to get wrong. I fetch the whole file into an in-memory ArrayBuffer and hand hyparquet a buffer it can read with no network in the loop:

const buf = await fetch(url).then(r => r.arrayBuffer());
// hand the reader an in-memory buffer; no Range requests, no offset math
const rows = await parquetReadObjects({ file: buf });

Gzip on a full-body fetch is not just harmless here, it’s a help — the browser transparently inflates the complete stream and you get the correct 18,929 bytes, magic and all. Compression only bites when you slice.

Range reads earn their keep when the file is genuinely big and you control the headers — move that Parquet to object storage where you can serve it with Content-Encoding: identity (or no transparent compression on byte-servable types), and HEAD-plus-Range is the right call again. On a static-hosting CDN that gzips behind your back, it’s a trap. Right tool, wrong host.

What I’m keeping

Ranges and transparent compression don’t mix — and it’s a spec subtlety, not a bug. A byte range addresses the transfer representation. The instant a proxy gzips your file, the bytes on the wire stop matching the file your reader is reasoning about, and any offset computed against the original lands in the wrong place. If you must do ranged reads, you must control Content-Encoding.

Test in a real browser, not just curl. curl’s defaults are quieter than a browser’s — no Accept-Encoding, different redirect and cookie behavior. A clean curl is not proof; it’s one client’s opinion. The bug lived entirely in the header a browser sends and curl omits, and I’d have found it an hour sooner by opening devtools first. (curl --compressed is the honest comparison.)

A green dashboard can be reading from somewhere else. My summary cards were correct and reassuring and completely irrelevant to the broken table, because they came from a different file. “The stats look fine” only means something if the stats are computed from the same source as the thing you’re trusting them to vouch for.

The site serves its rows now. It does it with the dumbest possible fetch — one request, whole file, let the browser inflate it — and that is exactly the right amount of cleverness for 18 kilobytes.