Skip links

From Monolith to Microservices: A Pragmatic Migration Path

Everyone has an opinion about microservices. The internet is full of conference talks about how Netflix split their monolith and blog posts about how you should not split yours. Both sides are right, depending on context. At Harbor Software, we completed a 14-month migration of a 180,000-line Python monolith into 11 services, and the result is genuinely better — but only because we followed a path that most microservices migration guides skip entirely. Here is the actual playbook, including the parts that felt wrong at the time but turned out to be correct.

Article Overview

From Monolith to Microservices: A Pragmatic Migration Path

7 sections · Reading flow

01
Why We Migrated (and Why Most Reasons Are Wrong)
02
Step 1: Draw the Boundaries Before Writing Any…
03
Step 2: The Strangler Fig Pattern (With a Twist)
04
Step 3: Database Decomposition (The Hardest Part)
05
Step 4: Event-Driven Communication (Not HTTP…
06
Step 5: Observability Is Not Optional
07
What We Got Wrong

HARBOR SOFTWARE · Engineering Insights

Why We Migrated (and Why Most Reasons Are Wrong)

The typical justification for microservices is “scalability.” This is almost always a premature optimization. Our monolith handled 12,000 requests per second on four application servers. Scaling was not the problem. The actual problems were deployment coupling and team contention.

Deployment coupling meant that a one-line change to the billing module required deploying the entire application, which included the ML inference pipeline, the admin dashboard, the webhook processor, and the API gateway. Our deployment took 18 minutes, and a failed canary rollback took another 12. When three teams are shipping multiple times per day, that 30-minute deployment window becomes a bottleneck. We tracked this: in our worst month, teams waited an average of 47 minutes per deploy due to queue contention on the deployment pipeline.

Team contention was worse. Seven engineers working in the same codebase meant constant merge conflicts in shared modules. The models/ directory alone generated 23 merge conflicts in a single sprint. Code review became a bottleneck because reviewers needed context about the entire application to evaluate changes safely. A change to the user model’s serialization format broke the billing module’s invoice generator because they shared a serializer that neither team owned.

These are the right reasons to migrate: when your deployment frequency is limited by coupling, and when your teams cannot work independently. “Our service needs to scale” is almost never the right reason — you can scale a monolith vertically and horizontally with load balancers, read replicas, and caching layers for a fraction of the operational cost of microservices.

There is a third reason that is less discussed but equally valid: regulatory compliance. Our billing service needed to meet PCI-DSS requirements, which mandated specific access controls, audit logging, and encryption at rest. Applying PCI-DSS controls to the entire monolith was impractical and expensive. Extracting the billing module into its own service with its own infrastructure allowed us to scope PCI-DSS compliance to a much smaller surface area. This alone saved us $40,000/year in compliance audit costs.

Step 1: Draw the Boundaries Before Writing Any Code

The most critical phase of a microservices migration happens before you write a single line of code. You need to identify service boundaries, and getting them wrong is catastrophically expensive to fix later. A misdrawn boundary creates a distributed monolith — all the operational complexity of microservices with none of the independence benefits.

We used a technique called domain event storming. The entire engineering team (plus product managers) spent two days in a room with orange sticky notes, mapping every business event in the system: “user registered,” “invoice generated,” “model training started,” “webhook received.” We grouped events by the data they needed access to and the team that owned the business logic. The groups that emerged became our service boundaries.

The rule we followed: if two domains need to read and write the same database table in real time, they belong in the same service. If they only need to react to changes in the other domain’s data, they can be separate services connected by events. This rule prevented us from creating chatty services that make synchronous calls to each other for every request. We violated this rule once (the auth service and the billing service both needed to write to a shared user_entitlements table) and spent 6 weeks untangling the resulting data consistency bugs.

The boundaries we identified:

  1. Auth Service — user identity, sessions, API keys, OAuth
  2. Billing Service — subscriptions, invoices, payment processing, usage metering
  3. Inference Service — model serving, request routing, result caching
  4. Training Service — model training jobs, hyperparameter management, experiment tracking
  5. Data Pipeline Service — ingestion, transformation, feature store
  6. Webhook Service — outbound webhooks, retry logic, delivery tracking
  7. Notification Service — email, Slack, in-app notifications
  8. Admin Service — internal dashboards, content management
  9. API Gateway — rate limiting, request routing, API versioning
  10. Job Scheduler — cron jobs, background task orchestration
  11. Analytics Service — usage analytics, reporting, data exports

Step 2: The Strangler Fig Pattern (With a Twist)

The strangler fig pattern is well-documented: you build new functionality in a new service, route traffic to it, and gradually move existing functionality out of the monolith until the monolith is empty. What most guides omit is the mechanics of the routing layer and the critical importance of running both systems in parallel.

We used nginx as a request router that sat in front of both the monolith and the new services. The routing rules lived in a configuration file that we could update without deploying any application code:

# nginx routing — gradual migration
upstream monolith {
    server monolith-1:8000;
    server monolith-2:8000;
}

upstream billing_service {
    server billing-1:8001;
    server billing-2:8001;
}

upstream inference_service {
    server inference-1:8002;
    server inference-2:8002;
}

# Phase 1: Only billing routes go to the new service
location /api/v1/billing/ {
    proxy_pass http://billing_service;
}

location /api/v1/invoices/ {
    proxy_pass http://billing_service;
}

# Phase 1: Everything else stays on the monolith
location /api/ {
    proxy_pass http://monolith;
}

The twist: we ran the monolith and the new service simultaneously for each migrated domain, with a shadow traffic system that sent copies of requests to both and compared responses. This caught 14 behavioral differences that our test suite missed, including a subtle timezone handling bug where the monolith used the server’s local timezone for date comparisons but the new billing service used UTC. That bug would have caused incorrect invoice amounts for customers in time zones behind UTC.

The shadow traffic system was not complex — roughly 200 lines of Python that intercepted responses from both the monolith and the new service, normalized the JSON (sorted keys, rounded floats to 2 decimal places, ignored non-deterministic fields like timestamps), and logged any differences. We ran shadow traffic for 2 weeks per service before switching production traffic. Of the 14 behavioral differences we caught, 9 were bugs in the new service, 3 were bugs in the monolith that had gone unnoticed for years, and 2 were intentional improvements that we had not documented.

Step 3: Database Decomposition (The Hardest Part)

Splitting the application code is straightforward. Splitting the database is where migrations fail. Our monolith used a single PostgreSQL database with 67 tables and 142 foreign key constraints. You cannot simply point two services at the same database and call it microservices — that creates a distributed monolith with a shared data dependency that couples every deployment.

We decomposed the database in three phases:

Phase A: Logical separation. We created PostgreSQL schemas within the same database instance, one per service domain. Each service’s database user had permissions only on its own schema. This gave us access control and visibility into which service touched which tables, without requiring data migration.

-- Phase A: Logical separation within same database
CREATE SCHEMA billing;
CREATE SCHEMA auth;
CREATE SCHEMA inference;

-- Move tables to their schemas
ALTER TABLE invoices SET SCHEMA billing;
ALTER TABLE subscriptions SET SCHEMA billing;
ALTER TABLE users SET SCHEMA auth;
ALTER TABLE sessions SET SCHEMA auth;

-- Restrict service users to their schemas
GRANT USAGE ON SCHEMA billing TO billing_service;
GRANT ALL ON ALL TABLES IN SCHEMA billing TO billing_service;
REVOKE ALL ON SCHEMA auth FROM billing_service;

Phase B: Cross-schema joins to API calls. The monolith had 31 queries that joined tables across what were now different schemas. We replaced each join with an API call or an event-driven data replication. For example, the billing service needed user email addresses to send invoices. Instead of joining billing.invoices with auth.users, the billing service now calls the auth service’s /users/{id} endpoint and caches the result for 5 minutes. This phase took 8 weeks — longer than we expected — because every cross-schema query needed individual analysis to determine whether a synchronous API call or an eventual-consistency event was the right replacement.

Phase C: Physical separation. After all cross-schema dependencies were eliminated, we migrated each schema to its own database instance using PostgreSQL logical replication. The billing service got a dedicated RDS instance, the auth service got another, and so on. This phase took 3 months and was the most operationally risky — we had a 4-hour window where we ran dual-writes to both the old and new database instances to verify consistency before cutting over.

The dual-write verification was essential. We compared every row written to the new database against the corresponding row in the old database. In the first hour, we found 3 rows that differed because of a race condition in our replication setup where a write to the old database and a near-simultaneous write to the new database arrived in different order. We fixed the race condition, reset the verification window, and the remaining 3 hours were clean.

Step 4: Event-Driven Communication (Not HTTP for Everything)

The naive approach to microservices communication is synchronous HTTP calls between services. This creates a distributed system with the worst properties of both monoliths and microservices: a failure in any downstream service cascades to every upstream caller, latency compounds multiplicatively, and you need to handle retries, timeouts, and circuit breaking in every service.

We use a hybrid approach. Synchronous HTTP for queries that need real-time responses (“get user profile,” “check subscription status”) and asynchronous events via Amazon SQS for state changes (“user signed up,” “invoice paid,” “model training completed”).

// Event publishing — billing service
import { SQSClient, SendMessageCommand } from '@aws-sdk/client-sqs';

const sqs = new SQSClient({ region: 'us-east-1' });

async function publishEvent(eventType: string, payload: Record<string, unknown>) {
  const event = {
    eventType,
    payload,
    timestamp: new Date().toISOString(),
    source: 'billing-service',
    correlationId: getCurrentCorrelationId(),
    version: 1
  };

  await sqs.send(new SendMessageCommand({
    QueueUrl: process.env.EVENT_QUEUE_URL,
    MessageBody: JSON.stringify(event),
    MessageAttributes: {
      eventType: { DataType: 'String', StringValue: eventType }
    }
  }));
}

// Usage: after creating an invoice
await createInvoice(data);
await publishEvent('invoice.created', {
  invoiceId: invoice.id,
  userId: invoice.userId,
  amount: invoice.amount,
  currency: invoice.currency
});

The version field in events is important. When you change the shape of an event payload, old consumers that have not been updated will break. We use event versioning so that consumers can specify which versions they understand, and the event router delivers the appropriate version. This decouples service deployment order — the billing service can ship a new event format without requiring the notification service to deploy simultaneously.

We also learned to never publish events inside database transactions. If the transaction commits but the event publish fails, the system is in an inconsistent state (data changed but no one was notified). If the event publishes but the transaction rolls back, consumers react to a change that did not happen. We use the transactional outbox pattern: events are written to an outbox table within the same database transaction as the data change, and a separate poller reads from the outbox table and publishes to SQS. This guarantees that events are published if and only if the data change committed.

Step 5: Observability Is Not Optional

A monolith has one log stream, one process to debug, one set of metrics. After splitting into 11 services, you have 11 log streams, 11 sets of metrics, and request flows that span multiple services. Without distributed tracing, debugging a production issue in a microservices architecture is like solving a murder mystery where each witness only saw one room.

We standardized on OpenTelemetry for tracing, Prometheus for metrics, and Grafana Loki for logs. Every service propagates trace context via the traceparent header, and every log line includes the trace ID. When a user reports a slow API call, we search for the trace ID and see the entire request flow: API gateway (2ms) -> auth service (15ms) -> inference service (180ms) -> model cache miss -> model loading (3,200ms). That last number immediately tells us the problem is a cold model cache, not a code bug.

The most valuable dashboard we built was not a metrics dashboard — it was a service dependency map that shows which services call which other services, with latency percentiles and error rates on each edge. This map exposed a dependency we had not intended: the notification service was making synchronous calls to the billing service to fetch invoice details for email templates. That synchronous dependency meant a billing service deploy could break notification delivery. We replaced it with an event that carries the necessary invoice details, eliminating the runtime coupling.

We also built an alert for “fan-out depth” — how many services a single incoming request touches. Requests that touch more than 4 services are flagged for architectural review. This caught several cases where well-intentioned developers added service-to-service calls that created long dependency chains. Our rule: a request should touch at most 3 services (gateway + 2 backends). If it needs more, the service boundaries are probably wrong.

What We Got Wrong

Three mistakes we made that you should avoid:

1. We extracted too many services too fast. Our initial plan was to extract all 11 services in parallel. By month 3, we had 6 partially-extracted services, none of which were fully independent, and the monolith was in a state of disrepair that made it hard to reason about. We should have extracted one service completely, learned from the process, and then started the next. After the painful month 3, we switched to sequential extraction and finished the remaining 5 services faster than the first 6.

2. We underestimated the testing burden. Integration tests that previously ran against one application now needed to spin up multiple services with their databases, event queues, and configuration. Our CI pipeline went from 8 minutes to 34 minutes. We eventually invested in contract testing with Pact, which lets each service test its API contracts independently without spinning up downstream services. This brought CI back to 12 minutes while providing stronger guarantees than our old integration tests.

3. We did not account for local development ergonomics. When every engineer needs to run 11 services locally to test a change, developer experience degrades sharply. We invested in two solutions: a docker-compose.dev.yml that runs all services with hot reloading, and a “service mesh lite” setup using Telepresence that lets a developer run one service locally while connecting to shared development instances of the others. The docker-compose approach works for small changes; Telepresence is better for deep debugging sessions.

The migration took 14 months, cost roughly 2,400 engineering hours, and was worth it. Our deployment frequency went from 3 deploys per day (shared) to 8-12 deploys per day (per service). Mean time to recovery for production incidents dropped from 45 minutes to 12 minutes because failures are isolated. And the teams can now work independently without stepping on each other’s code. But I would not do it again for a team smaller than 6 engineers or a codebase under 100,000 lines. Below that threshold, the operational overhead of microservices exceeds the organizational benefits.

Leave a comment

Explore
Drag