ScaleGuild · Swapnil Surdi

The problem

“We need to scale” almost always gets answered with hardware before anyone measures. I wanted the opposite: pin the hardware to a fixed budget — one vCPU — and find out how far a Python API actually goes, and which knobs move the number. Not a hello-world benchmark; a service with a database, a slow external dependency, and a mixed workload, measured the same way at every step.

So I built one FastAPI service and walked it through six staged configurations, changing exactly one thing at a time, with the identical k6 load profile replayed against each.

The method

The workload is a k6 script driving six weighted endpoint types — the mix you’d see in a real API rather than one hot path: light reads, DB-bound reads, writes, and endpoints that call an external API. The external dependency is a Go mock API I wrote that serves responses with a lognormal latency distribution, because real third-party latency has a long tail and a constant sleep(200ms) teaches you nothing about it.

Each configuration runs to completion, k6’s handleSummary writes a per-config JSON file, and a merge script folds them into one comparison table. A Go + bubbletea TUI watches runs live so I could see a config collapse in real time instead of discovering it in the summary. The storage layer is SQLite, which turned out to be a finding in itself.

The ladder

the shape of the curve — 1.68 → 69.6 was the concurrency model, 69.6 → 90 was process topology; hardware contributed nothing because none was added.

Same machine, same load, six configs:

v1 — sync handlers: 1.68 RPS. The baseline everyone deploys by accident. Sync handlers with blocking I/O serialize the whole service; with a slow dependency in the mix, throughput is effectively one request at a time.
v2 — async handlers: 69.6 RPS. A 41× improvement from zero new hardware. Once handlers await their I/O, the single event loop overlaps every wait, and the service is suddenly limited by work, not by blocking.
v3 — two uvicorn workers: 81 RPS. Two processes on one vCPU buy some overlap of Python CPU work with I/O — worth 16%, not another multiple.
v4 — gunicorn + UvicornWorker: 90 RPS. The standard production topology; supervision plus a small additional gain. This is the ceiling I found on one vCPU.
v5 — SQLite WAL + pragma tuning: 85 RPS. A tuned config that lost to v4 — and the honest lesson of the study. Connection-pool sizing, the knob I wanted to measure, doesn’t exist on this stack: aiosqlite runs through SQLAlchemy’s NullPool, so pool parameters are silently a no-op. SQLite simply can’t express pool-size effects; that experiment needs Postgres.
v6 — realistic external latency (designed, run pending). The mock API reconfigured to P50 4s / P99 22s — actual numbers from a slow upstream I’ve lived with. The scenario is built to demonstrate the bottleneck inversion: at those latencies the CPU goes idle and capacity becomes concurrency slots — how many requests can be parked mid-await — so the limits that matter are semaphores, timeouts, and connection caps, not RPS-per-core.

Decisions that mattered

Change one variable per stage. The 41× claim is only credible because v1 and v2 differ in nothing but async def. Most published “we scaled X” posts change five things at once.

Make the dependency realistic. The lognormal mock means tail latency interacts with worker and connection limits the way a real vendor API does. It’s also why v6 exists as a designed scenario: the study’s end state is “the bottleneck moved,” not “bigger number.”

Treat the harness as production code. Two gunicorn workers raced through init-and-seed on startup and corrupted the seed data; the fix was an fcntl.flock around initialization so exactly one worker seeds while the other waits. Load-test harnesses have concurrency bugs too — mine just surfaced earlier than most.

Keep results mergeable. Per-config handleSummary files plus a merge script meant every rerun slotted into the same comparison table. No spreadsheet archaeology.

Lessons

The headline is the shape of the curve: 1.68 → 69.6 RPS came from fixing the concurrency model; 69.6 → 90 came from process topology; hardware contributed nothing because I never added any. If your Python API is slow, the order of operations is async correctness first, worker topology second, and only then talk about machines.

The second lesson is about measurement honesty. v5 regressing below v4 is a more useful data point than another increment would have been — it exposed that my storage layer couldn’t answer the question I was asking. And the pending v6 run reframes scaling entirely: once your upstream’s P50 is 4 seconds, you’re not scaling throughput anymore, you’re managing in-flight concurrency. That inversion is where most real systems actually live.

The code isn’t public yet; the harness, mock API, and TUI are being cleaned up for release.