Why We Chose Python and FastAPI for Our AI Backend
When we started building Harbor Software’s core inference platform in late 2021, the backend framework decision felt unusually consequential. We were building something that needed to serve ML model predictions with sub-200ms latency, handle concurrent long-running inference jobs, and remain approachable enough that a team of four engineers could maintain it without a dedicated DevOps hire. After evaluating Flask, Django, Express.js, and Go’s standard library, we landed on Python with FastAPI. Here is why, and what we learned in the first six months of production.
The Constraint That Shaped Everything
Our primary constraint was not performance. It was ecosystem compatibility. Every major ML library we needed—PyTorch, Hugging Face Transformers, scikit-learn, NumPy, pandas—is Python-first. When you pick a non-Python backend language, you introduce a serialization boundary between your API layer and your model inference layer. That boundary costs you latency, debugging complexity, and deployment headaches.
We briefly considered a polyglot architecture: Go or Rust for the API gateway, calling into Python microservices for inference. On paper this looks elegant. In practice, for a team of four, it means maintaining two build systems, two dependency management strategies, two sets of Docker images, and debugging serialization bugs at 2 AM when a NumPy array doesn’t round-trip cleanly through gRPC. We had seen this pattern fail at a previous company where the Go-to-Python boundary became the source of 40% of production incidents—not because either language was at fault, but because the serialization layer between them was fragile and under-tested. Edge cases in NumPy dtype handling, timezone-naive datetime objects, and NaN values in float arrays all caused silent data corruption that only surfaced hours later in downstream analytics.
The calculation was straightforward: keep the entire stack in Python, pick the fastest Python framework available, and optimize at the infrastructure level (caching, async I/O, worker pools) rather than the language level. Python is not the fastest language, but it is the language where our entire ML toolchain lives natively, and that compatibility advantage outweighs raw throughput for our use case. When your inference pipeline involves torch tensors, Hugging Face tokenizers, and scikit-learn preprocessors, running all of that in the same process as your HTTP handler eliminates an entire category of integration bugs.
Why FastAPI Over Flask and Django
Flask was the obvious default. We had used it on three previous projects. But Flask has two fundamental limitations for ML serving:
- No native async support. Flask’s WSGI foundation means every request blocks a worker. When you are running inference jobs that take 500ms–2s, you exhaust your worker pool fast. Yes, you can bolt on gevent or use Quart, but at that point you are fighting the framework rather than using it. We tested gevent-patched Flask and found it introduced subtle bugs with PyTorch’s threading model—model loading would occasionally deadlock under concurrent requests because gevent’s monkey-patching conflicted with PyTorch’s internal thread pool.
- No built-in request validation. Flask leaves input validation entirely to you. For an ML API where payloads contain nested objects with specific tensor shapes, numeric ranges, and enum constraints, this means writing hundreds of lines of validation boilerplate or pulling in marshmallow/cerberus and wiring it up manually. On a previous project, we spent three weeks building and maintaining a custom validation layer for Flask that FastAPI provides out of the box.
Django REST Framework was overkill. We did not need an ORM, admin panel, or template engine. DRF adds 40+ dependencies and a mental model designed for CRUD applications, not inference pipelines. The Django ORM is excellent for relational data but irrelevant when your primary data path is “receive JSON, run tensor through model, return JSON.” We would be paying the complexity cost of a full web framework without using its primary features.
FastAPI solved both problems natively:
from fastapi import FastAPI
from pydantic import BaseModel, Field
from typing import List
app = FastAPI()
class InferenceRequest(BaseModel):
text: str = Field(..., min_length=1, max_length=10000)
model_id: str = Field(..., pattern="^[a-z0-9-]+$")
top_k: int = Field(default=5, ge=1, le=100)
temperature: float = Field(default=0.7, ge=0.0, le=2.0)
class PredictionResult(BaseModel):
label: str
score: float
metadata: dict
class InferenceResponse(BaseModel):
predictions: List[PredictionResult]
latency_ms: float
model_version: str
@app.post("/predict", response_model=InferenceResponse)
async def predict(request: InferenceRequest):
# Pydantic validates everything before this line executes
result = await run_inference(request)
return result
That Pydantic integration is not just syntactic sugar. It generates OpenAPI documentation automatically, validates every incoming request against the schema, and serializes responses with correct types. Our API documentation was production-ready from day one without writing a single line of Swagger YAML. When we onboarded our first enterprise customer, they had a working integration within two hours because the auto-generated documentation was complete and accurate. With Flask, producing equivalent documentation would have required maintaining a separate OpenAPI spec file manually.
Async Performance in Practice
FastAPI runs on Starlette and ASGI (Asynchronous Server Gateway Interface), which means it handles I/O-bound operations without blocking workers. This matters enormously for ML serving because a typical request involves:
- Receiving the HTTP request (I/O)
- Loading or retrieving a cached model (I/O or CPU)
- Running inference (CPU-bound)
- Logging the result to a database (I/O)
- Returning the response (I/O)
Steps 1, 2, 4, and 5 are I/O-bound and benefit directly from async. Step 3—the actual inference—is CPU-bound and needs to be offloaded. We use a combination of asyncio.to_thread() for lightweight models and a dedicated process pool for GPU inference:
import asyncio
from concurrent.futures import ProcessPoolExecutor
gpu_pool = ProcessPoolExecutor(max_workers=2)
async def run_inference(request: InferenceRequest):
loop = asyncio.get_event_loop()
# Offload GPU work to process pool
result = await loop.run_in_executor(
gpu_pool,
_sync_inference,
request.text,
request.model_id
)
return result
In our benchmarks using wrk with 100 concurrent connections, FastAPI with uvicorn handled 1,847 requests/second on a mixed workload (50% cache hits, 50% inference). The equivalent Flask setup with gunicorn and gevent managed 612 requests/second. That is a 3x difference with zero algorithmic optimization—purely from the framework’s async foundation. Under sustained load over 10-minute test runs, FastAPI maintained consistent p99 latency of 180ms while Flask’s p99 climbed to 450ms as gevent’s scheduling overhead accumulated.
The process pool approach deserves additional explanation. We initially tried asyncio.to_thread() for all inference, but PyTorch’s GIL contention under concurrent inference meant that two simultaneous predictions took nearly as long as running them sequentially. The process pool gives true parallelism (separate Python interpreters, no GIL contention) at the cost of inter-process serialization overhead. For our typical payload sizes (1–10 KB), that serialization overhead is under 1ms—negligible compared to inference time.
Dependency Injection That Actually Works
FastAPI’s dependency injection system turned out to be more valuable than we initially expected. In an ML serving context, you frequently need to inject configuration, database connections, model registries, and feature stores into your route handlers. FastAPI’s Depends() system handles this cleanly:
from fastapi import Depends
async def get_model_registry():
"""Singleton model registry with lazy initialization."""
if not hasattr(get_model_registry, "_instance"):
get_model_registry._instance = ModelRegistry(
cache_dir="/models",
max_cached_models=10
)
return get_model_registry._instance
async def get_feature_store(registry=Depends(get_model_registry)):
"""Feature store that depends on model registry for schema info."""
return FeatureStore(schema_source=registry)
@app.post("/predict")
async def predict(
request: InferenceRequest,
registry: ModelRegistry = Depends(get_model_registry),
features: FeatureStore = Depends(get_feature_store)
):
model = await registry.get(request.model_id)
enriched_input = await features.enrich(request.text)
return await model.predict(enriched_input)
This is not a toy example. Our production code follows exactly this pattern. Dependencies compose, they can be async, and they are trivially mockable in tests. Compare this with Flask’s g object or Django’s middleware stack, and the ergonomic advantage is significant. In our test suite, we override dependencies with mocks using FastAPI’s app.dependency_overrides dictionary, which means integration tests can swap out the real model registry for a fake one in a single line. This made our test suite 8x faster because tests do not need to load actual ML models.
The Pydantic Tax and How We Optimized It
Pydantic is not free. Validation adds latency, and for high-throughput endpoints it is measurable. We profiled our hot path and found that Pydantic validation consumed 2.3ms per request for our most complex schema (a nested object with 47 fields across three levels). For our p99 latency target of 200ms, that is roughly 1% of the budget.
We made three optimizations over time:
- Used
model_config = ConfigDict(validate_assignment=False)on response models. We trust our own inference output; validating it on the way out is wasteful. This alone saved 0.8ms per request on our response-heavy endpoints. - Switched to Pydantic v2 as soon as it stabilized. The Rust-backed validation core reduced our per-request validation overhead from 2.3ms to 0.4ms—an 82% reduction. The migration took about a day because Pydantic v2 has some breaking changes in validator syntax, but the performance gain was substantial.
- Pre-compiled validators for hot paths. For our highest-throughput endpoint (embedding generation, which receives simple text payloads), we replaced the Pydantic model with a hand-written validator that skips schema introspection. This shaved another 0.2ms off the hot path. We only did this for one endpoint; the rest use Pydantic because the ergonomic benefits outweigh the microsecond-level overhead.
For most applications, the Pydantic overhead is negligible. If you are serving tens of thousands of requests per second and every microsecond matters, consider validating only on ingress and using plain dataclasses or TypedDicts internally.
Deployment: Uvicorn, Gunicorn, and Docker
Our production deployment runs uvicorn behind gunicorn with uvloop:
# Dockerfile excerpt
CMD ["gunicorn", "app.main:app",
"--worker-class", "uvicorn.workers.UvicornWorker",
"--workers", "4",
"--bind", "0.0.0.0:8000",
"--timeout", "120",
"--graceful-timeout", "30",
"--access-logfile", "-"]
We run 4 gunicorn workers, each with its own uvicorn event loop. This gives us process-level isolation (one worker crashing does not take down the others) while maintaining async concurrency within each worker. The --timeout 120 is deliberately high because some inference jobs (batch processing, large document analysis) can legitimately take 60–90 seconds.
One lesson learned: do not use uvicorn’s --reload flag in production containers. We saw intermittent 502s in staging that traced back to file watchers triggering spurious reloads inside Docker when volume mounts updated. This took us two days to diagnose because the reload happened silently—the worker restarted, dropped in-flight requests, and the load balancer retried them, making it look like random latency spikes rather than a configuration issue.
We also learned to set --graceful-timeout 30 to give in-flight inference requests time to complete during deployments. Without this, a deployment kills workers immediately, and any inference request in progress returns a 502. With a 30-second graceful timeout, the old worker finishes serving current requests before shutting down. Rolling deployments (restarting one worker at a time) combined with graceful shutdown give us zero-downtime deployments.
Middleware and Cross-Cutting Concerns
FastAPI’s middleware system (inherited from Starlette) handles cross-cutting concerns cleanly. Our production middleware stack includes:
from starlette.middleware.cors import CORSMiddleware
from fastapi import FastAPI
import time
import uuid
app = FastAPI()
# CORS for browser clients
app.add_middleware(
CORSMiddleware,
allow_origins=["https://app.harborsoftware.com"],
allow_methods=["*"],
allow_headers=["*"],
)
@app.middleware("http")
async def add_request_id(request, call_next):
request_id = str(uuid.uuid4())
request.state.request_id = request_id
response = await call_next(request)
response.headers["X-Request-ID"] = request_id
return response
@app.middleware("http")
async def log_requests(request, call_next):
start = time.perf_counter()
response = await call_next(request)
duration = (time.perf_counter() - start) * 1000
logger.info(
"request_completed",
method=request.method,
path=request.url.path,
status=response.status_code,
duration_ms=round(duration, 2),
request_id=request.state.request_id
)
return response
The request ID middleware is invaluable for debugging. Every log line includes the request ID, so you can trace a single request through the entire system—from ingress, through model loading, inference, and response serialization—by searching for that ID. When a customer reports a slow response, they give us the request ID from the response header, and we can reconstruct exactly what happened.
What We Would Do Differently
Six months in, there are four things we would change:
- Start with structured logging from day one. We used Python’s standard
loggingmodule initially and spent two weeks migrating tostructlogwhen we needed JSON-formatted logs for our observability stack. FastAPI does not prescribe a logging solution, and we should have been more deliberate about this upfront. Structured logging is not a nice-to-have—it is essential for any service that handles more than a few requests per minute. - Define error response schemas explicitly. FastAPI’s default error responses (422 for validation, 500 for unhandled exceptions) are fine for development but inadequate for production. We now have a custom exception handler that returns consistent error envelopes with error codes, human-readable messages, and trace IDs. This took a day to implement and should have been done in week one.
- Use lifespan events instead of startup/shutdown hooks. FastAPI’s newer lifespan context manager pattern is cleaner than the deprecated
@app.on_event("startup")decorator, especially for managing model loading and connection pool lifecycle. The context manager approach makes resource cleanup more reliable because it uses Python’s native context management protocol. - Set up health check endpoints immediately. We added
/healthand/readyendpoints in week three after our load balancer was routing traffic to containers that had not finished loading models. A/readyendpoint that checks whether models are loaded would have prevented several early production incidents.
Conclusion
FastAPI was the right choice for Harbor Software’s AI backend. It gave us async performance, automatic API documentation, rigorous request validation, and full compatibility with the Python ML ecosystem—all without the overhead of a full-stack framework we did not need. The framework has matured considerably since we adopted it, and with Pydantic v2’s performance improvements, the remaining trade-offs (slight validation overhead, younger ecosystem compared to Flask) are increasingly marginal. Our API serves 50,000 inference requests per day with p99 latency under 200ms on modest hardware (4 vCPUs, 16 GB RAM), which would not be achievable with a synchronous framework without significantly more infrastructure investment.
If you are building an ML-serving API and your team already knows Python, FastAPI should be at the top of your shortlist. The productivity gains from Pydantic integration alone will save you weeks of validation boilerplate, and the async foundation will save you from painful scaling rewrites down the road. Start simple—a single file with one endpoint is fine—and let the framework’s dependency injection and middleware systems scale with you as the application grows.
Appendix: Benchmark Methodology
For teams considering their own evaluation, here is how we ran our benchmarks. We used wrk version 4.2.0 on an Ubuntu 22.04 machine with 8 vCPUs and 32 GB RAM. The test machine was separate from the server to avoid resource contention. Each benchmark ran for 60 seconds with 4 threads and 100 concurrent connections. We tested three workload profiles: pure health-check (baseline, no model work), cached inference (model result retrieved from Redis), and live inference (actual model computation). The numbers we quote throughout this post are from the cached inference profile, which represents the most common production workload where approximately 50% of requests hit the model cache.
To reproduce: install wrk, deploy FastAPI with uvicorn behind gunicorn (4 workers), deploy Flask with gunicorn and gevent (4 workers), and run wrk -t4 -c100 -d60s http://localhost:8000/predict with a fixed request body. We repeated each test 5 times and report the median. The raw benchmark data and scripts are available in our internal wiki, and we plan to publish a more detailed comparison in a future post that includes Go and Rust baselines for context. The key finding that held across all test configurations was that FastAPI’s async architecture handled concurrent mixed workloads fundamentally better than Flask’s synchronous-with-patches approach, with the gap widening as concurrency increased.