Skip to content

writing

The 160× index: a 4.18-second dashboard and the COUNT(*) that ate it

My fleet dashboard quietly degraded to 4.18s. The cause: one COUNT(*) full-scanning 258k rows on every load. One index later: ~18ms, flat forever.

· 5 min read

  • #sqlite
  • #performance
  • #go
  • #war-story

Dashboards rarely die loudly. Mine got a little slower every day for a week, and because I was the only user, nobody filed a ticket. By the time I actually measured it, the fleet board was taking 4.18 seconds to load.

This is a short war story about the query that did it, the one-line fix, and the three lessons I’m keeping.

The setup

The dashboard is the front of a small status hub I wrote for my self-hosted fleet: a single-file Go service with SQLite underneath (WAL mode, pure-Go driver, distroless image). Every machine phones home once a minute with host vitals and per-container state. The hub’s /api/hosts endpoint builds the home board — one card per machine: online/offline, CPU, memory, uptime, and a live container count.

The schema is simple. Each 60-second report inserts one parent row into reports, then one child row per container into containers, keyed by report_id:

CREATE TABLE containers (
  report_id INTEGER NOT NULL, host TEXT NOT NULL, ts INTEGER NOT NULL,
  name TEXT NOT NULL, image TEXT, state TEXT, health TEXT,
  cpu_pct REAL, mem_used INTEGER, mem_total INTEGER
);
CREATE INDEX idx_containers_host_ts ON containers(host, ts);

Note the index I did create: (host, ts), for the time-series queries I knew I’d need. Hold that thought.

The creep

When the hub shipped, /api/hosts was effectively instant. Then it did its job for a week: 60-second beats, around 19 containers across the fleet, every beat appending rows. By 2026-06-08 the containers table held roughly 258,000 rows — and the board took 4.18 seconds.

That’s the insidious part. Nothing failed. No error, no log line, no alert — just latency growing in proportion to accumulated history, a few hundred milliseconds at a time. A dashboard that degrades with data volume doesn’t look broken on any single day. It just gets worse forever.

The hunt

I’ll be honest about my guesses, because they were all wrong. I suspected the latest-report-per-host self-join. I suspected the chart downsampling. I even suspected I/O overhead on the database bind-mount. The actual culprit looked too dumb to be slow.

For each host card, the handler runs one tiny query to get the live container count:

SELECT COUNT(*) FROM containers WHERE report_id = ?;

There was an index on (host, ts). There was no index on report_id. SQLite makes the consequence brutally explicit:

sqlite> EXPLAIN QUERY PLAN
   ...> SELECT COUNT(*) FROM containers WHERE report_id = 41218;
SCAN containers

A full scan of 258k rows — to count the ~19 rows belonging to one report. Multiply by one COUNT per host card, on every page load, with the table growing by tens of thousands of rows a day. That’s the whole regression: N full scans of an ever-growing table, hiding inside the most innocent-looking query in the file.

The fix

One line:

CREATE INDEX IF NOT EXISTS idx_containers_report_id ON containers(report_id);

The plan flips from a scan to a B-tree lookup:

SEARCH containers USING COVERING INDEX idx_containers_report_id (report_id=?)

/api/hosts went from 4.18 s to ~18 ms — call it 160×. But the speedup isn’t the headline. The headline is that latency is now decoupled from data volume: the COUNT costs the same at 258k rows as it will at 25 million. More retention, more machines, more history — the board stays flat. Before the index, every quiet day of normal operation made the dashboard slower; after it, time stopped being a performance input.

The price: one extra B-tree (a few megabytes, pruned along with the table) and a sliver of write amplification per container insert. On a write-light, read-heavy workload, that’s free.

Shipping it twice

Deployment was pleasantly boring. Under WAL, building the index on a 258k-row table is near-instant and readers don’t block writers, so I applied it to the live database with zero downtime — the dashboard was slow on one request and fast on the next.

Then I did the part that’s easy to skip: I added the index to the embedded schema in the Go binary — the CREATE ... IF NOT EXISTS block that self-applies on boot — with a comment explaining the regression it prevents. The live CREATE INDEX fixed today. The schema change fixed every future rebuild, fresh deploy, and disaster-recovery restore. If the fix only exists in the production database file, it isn’t a fix; it’s an oral tradition.

What I’m keeping

1. Indexes are schema, not ops folklore. An index applied by hand to a live database is a haunted artifact — the next rebuild silently regresses, and whoever debugs it gets to rediscover this post. The source-of-truth schema is the only durable home for a performance fix. (The comment now sitting in mine: any WHERE col = ? against a high-churn table needs an index on that column.)

2. Dashboards degrade silently, because latency that grows with history throws no errors. Uptime checks won’t catch it; they confirm the endpoint answers, not that it answers like it did last month. Watch latency over weeks, and treat “this query’s cost scales with table size” as a bug class of its own — the goal isn’t fast today, it’s flat as data accumulates.

3. Measure before guessing. My intuitions pointed at the clever parts of the code — joins, downsampling, I/O. The profiler pointed at a one-line COUNT. Five minutes of EXPLAIN QUERY PLAN beat a day of speculative refactoring, and it always will.

The board answers in 18 milliseconds now, with months of history behind it. Total cost of the fix: one line of SQL, a few megabytes of B-tree, and a small bruise on my confidence in my own guesses. Two of those three were worth it.