41× from one keyword · Swapnil Surdi

“We need to scale” almost always gets answered with hardware. Add a core, add a replica, add a bigger box — anything but measuring first. I wanted to know what happens if you refuse to add hardware at all: pin a Python API to one vCPU, leave it there, and find out how far software alone takes you and which knob each multiple comes from.

The headline number is a little absurd. On a fixed 1 vCPU budget, the same FastAPI service went from 1.68 requests per second to 69.6 — a 41× improvement — by changing one keyword. No new cores, no extra workers, no database tuning. Just def to async def. This is the study that produced that number, the diminishing ladder after it, and the scenario where the whole framing flips.

A caveat up front, because honesty is the point of a benchmark: this is a personal study, run on my own machine to understand the shape of the curve. The harness, the mock API, and the live TUI aren’t public yet — they’re being cleaned up for release. Treat the numbers as a careful, reproducible-by-me experiment, not a vendor datasheet.

The method, because the method is the result

The fastest way to lie with a benchmark is to change five things and credit the win to your favorite one. So the entire study is built around a single rule: change exactly one variable per stage, replay the identical load against each.

The workload is a k6 script driving six weighted endpoint types — the mix you’d actually see in production rather than one hot path: light reads, DB-bound reads, writes, and endpoints that call an external API. A single hammered route would have flattered the async numbers; a realistic mix makes them earn it.

The external dependency is a Go mock API I wrote that serves responses on a lognormal latency distribution. This detail matters more than it looks. Real third-party latency has a long, ugly tail — a fast P50 and a P99 that’s multiples worse — and that tail is exactly what interacts with your concurrency limits. A constant sleep(200ms) teaches you nothing about it; a lognormal one behaves like a vendor you’ve actually had to live with.

Each configuration runs to completion. k6’s handleSummary writes a per-config JSON file, and a merge script folds them all into one comparison table, so every rerun slots into the same place and there’s no spreadsheet archaeology. A small Go + bubbletea TUI watches runs live, so I could see a config collapse in real time instead of discovering it in a summary afterward. The storage layer is SQLite — which, it turns out, became a finding all by itself.

The ladder

Same machine, same vCPU, same six-endpoint load. Six configurations:

v1 — sync handlers: 1.68 RPS. The baseline everyone ships by accident. Synchronous handlers doing blocking I/O serialize the entire service: while one request waits on the database or the slow upstream, nothing else moves. With a long-tailed dependency in the mix, you’re effectively processing one request at a time, and the throughput shows it.

v2 — async handlers: 69.6 RPS. Change def to async def, await the I/O, and the single event loop overlaps every wait. While one request is parked waiting on the upstream, dozens of others make progress. The service stops being limited by blocking and starts being limited by work — and on this workload that’s a 41× jump from zero new hardware. This is the entire thesis of the study in one step: the first and largest multiple was never a hardware problem. It was a concurrency-model problem wearing a hardware costume.

v3 — two uvicorn workers: 81 RPS. Now I spend the hardware-shaped knobs. Two worker processes on one vCPU buy some overlap of Python’s CPU work with I/O waits — about 16%. Worth having. Not another multiple. The async event loop had already captured most of the available concurrency; a second process just trims the edges.

v4 — gunicorn + UvicornWorker: 90 RPS. The standard production topology — gunicorn supervising Uvicorn workers — adds process management and a small further gain. This is the ceiling I found on one vCPU: 90 RPS, up from 1.68. Everything from here is single-digit percentages or regressions.

v5 — SQLite WAL + pragma tuning: 85 RPS. A tuned configuration that came in below v4 — and it’s the most useful data point in the study, precisely because it went the wrong way. I’d wanted to measure connection-pool sizing here. It turns out that knob doesn’t exist on this stack: aiosqlite runs through SQLAlchemy’s NullPool, so every pool parameter I set was silently a no-op. SQLite simply can’t express pool-size effects — there’s no connection pool to size. The experiment I wanted to run needs Postgres; SQLite was structurally incapable of answering the question, and the regression is what exposed that.

v6 — realistic external latency (designed, run pending). The mock API reconfigured to P50 4s / P99 22s — real numbers from a slow upstream I’ve actually shipped against. This scenario exists to demonstrate the punchline, which deserves its own section.

The punchline: at 4-second upstreams, the bottleneck inverts

Here’s the thing the ladder is secretly building toward. Every number above was taken against a fast mock. Crank the upstream’s P50 to 4 seconds — a completely ordinary number for a third-party API — and the entire bottleneck moves.

When your dependency answers in milliseconds, the constraint is roughly CPU and event-loop throughput: how fast can this core actually do work. When your dependency takes 4 seconds, the CPU goes idle. It’s not doing work; it’s waiting. And capacity stops being “RPS per core” and becomes how many requests you can park mid-await at once — how many in-flight connections, how many concurrency slots, before something — a semaphore, a connection cap, a timeout — says no.

At that point the concurrency model isn’t a factor, it’s the factor. A sync service melts: every 4-second wait blocks a worker completely, and you run out of workers almost immediately. An async service parks thousands of requests on the event loop for the cost of a coroutine each, and your real limits become semaphores, timeouts, and connection ceilings — not cores. RPS-per-core stops being the right unit entirely.

This is the inversion most “we scaled X” posts never reach, because their benchmark used a fast or constant-latency dependency and never left the regime where CPU is king. The regime where most real systems actually live — behind a slow upstream — is the one where the concurrency model is everything and the hardware barely matters. That’s why v6 is a designed scenario and not an afterthought: the study’s end state isn’t “bigger number,” it’s “the bottleneck moved, and here’s where to.”

What I’m keeping

Async correctness first, topology second, hardware approximately never. The shape of the whole curve is the lesson: 1.68 → 69.6 was the concurrency model; 69.6 → 90 was process topology; hardware contributed nothing because I never added any. If a Python API is slow, that’s the order of operations — fix the concurrency model, then the worker topology, and only then start a conversation about machines. Most teams do it exactly backwards and buy a bigger box to paper over a def that should have been async def.

One variable per stage, or your numbers are fiction. The 41× claim is only credible because v1 and v2 differ in nothing but async. The moment you change the code and the worker count and the database config together and report one number, you’ve measured your own enthusiasm.

A regression that exposes a wrong question beats an increment. v5 dropping below v4 taught me more than another small gain would have: it proved my storage layer couldn’t answer the question I was asking, and pointed at exactly what to swap to ask it properly. Benchmarks are supposed to surprise you; when one does, that’s the data, not a setback.

The number that started this was one keyword and a 41× jump. The thing I actually walked away with is subtler: throughput on a fixed budget is a property of your concurrency model far more than your hardware — and once there’s a slow dependency in the picture, it’s a property of almost nothing else.