Developer alone at desk late at night with alert dashboards on multiple monitors

How We Handle Incident Response as a Small Team

Author Marcus Okafor

Published on: December 20, 2024

Harbor Software has eight engineers. We do not have a dedicated SRE team, a 24/7 NOC, or a VP of Incident Management. What we do have is a structured incident response process that has successfully handled 23 production incidents over the past 18 months, with a mean time to resolution of 47 minutes and zero incidents where we lost customer data. This post describes our process in detail, including the parts that are messy and imperfect, because most incident response literature is written for companies with hundreds of engineers and does not translate to small teams.

Article Overview

How We Handle Incident Response as a Small Team

7 sections · Reading flow

01
The On-Call Rotation: Simple and Sustainable

→

02
The Incident Process: Four Phases

→

03
Severity Levels That Actually Mean Something

→

04
Tooling That Works for Eight People

→

05
The Postmortem Culture: Blameless Is Not Optional

→

06
Communication During Incidents

→

07
What We Still Get Wrong

HARBOR SOFTWARE · Engineering Insights

The On-Call Rotation: Simple and Sustainable

We run a weekly on-call rotation across four engineers (half the team). The other four engineers are not in the rotation because they work on client-facing projects that cannot tolerate unpredictable interruptions. On-call engineers rotate every Monday at 10 AM, and we use PagerDuty for alerting with a 5-minute escalation to a secondary on-call.

The critical design decision was what not to alert on. Our first iteration of alerting fired PagerDuty for every 500-level error, every queue depth spike, and every latency breach. This produced 15-20 pages per week, most of which were transient issues that resolved automatically. The on-call engineer was exhausted by Wednesday and started ignoring alerts by Friday. We were paying for an alerting system that was training our engineers to ignore alerts. That is worse than having no alerting at all, because it creates a false sense of security.

We rebuilt our alerting around the principle of alerting on customer impact, not on system behavior. The specific alerts that fire PagerDuty are:

Error rate exceeds 5% of requests over a 5-minute window. A single 500 error is noise. 5% sustained errors means customers are being affected. We chose 5% because our historical data showed that error rates below 3% always resolved within one polling interval (60 seconds), while rates above 5% consistently required human intervention.
P99 latency exceeds 3x the trailing 7-day P99 for more than 10 minutes. This catches real slowdowns while ignoring transient spikes from garbage collection, cold starts, or one-off expensive queries. The 3x multiplier and 10-minute duration were tuned over several months to eliminate false positives while catching genuine degradation.
Queue depth exceeds the processing capacity for more than 15 minutes. This means messages are accumulating faster than we can process them, and the backlog will grow until intervention. We calculate processing capacity as the average processing rate over the trailing hour, and alert when the queue depth exceeds 15 minutes of that rate.
Health check failures from any service for more than 3 consecutive checks (90 seconds). A single failed health check is a network blip. Three consecutive failures means the service is genuinely down.
Certificate expiration within 7 days. We automate cert renewal with cert-manager, but the alert catches cases where automation fails.
Disk usage exceeds 85% on any persistent volume. This gives us enough runway to address the issue before it becomes an outage.

This reduced our alert volume from 15-20 per week to 2-3 per week, every one of which requires action. On-call is now sustainable. The engineers in the rotation report that on-call weeks are mildly inconvenient rather than dreaded.

The Incident Process: Four Phases

When an alert fires, the on-call engineer follows a documented four-phase process. The process is written down in a Notion runbook that every on-call engineer has bookmarked. We do not rely on tribal knowledge, because tribal knowledge fails at 3 AM when the on-call engineer is half-asleep and cannot remember which database server the auth service uses.

Phase 1: Acknowledge and Assess (0-5 minutes). Acknowledge the PagerDuty alert. Open the relevant Grafana dashboard (we have pre-built dashboards for each service). Determine the blast radius: which services are affected, which customers are impacted, and whether the issue is getting worse. Post a one-line summary in the #incidents Slack channel. This initial assessment determines the severity level and whether to wake up additional engineers.

// Incident assessment template (posted to #incidents)
[SEV-X] Incident: [Brief description]
Impact: [Which customers/services are affected]
Trend: [Getting worse / Stable / Improving]
Status: Investigating
On-call: [@name]
Dashboard: [link]

Phase 2: Mitigate (5-30 minutes). The goal of mitigation is to stop the customer impact, not to fix the root cause. This distinction is critical and frequently misunderstood. An engineer’s instinct is to debug the problem and deploy a fix. But debugging takes time, and customers are suffering while you debug. Mitigation means: stop the bleeding first, understand the cause later.

Common mitigations include: rolling back the last deployment, scaling up a service that is overloaded, failing over to a standby database, toggling a feature flag to disable a broken feature, or adding a rate limit to protect an overwhelmed dependency.

We maintain a list of pre-approved mitigations that the on-call engineer can execute without asking for approval:

Roll back any deployment made in the last 24 hours
Scale any service up to 3x its current replica count
Toggle any feature flag in the incident-mitigation category
Enable maintenance mode for non-critical services
Block a specific IP or IP range if there is an attack in progress
Restart any service (full stop and start, not rolling restart)

Anything outside this list requires a second engineer to approve. The approval is a Slack message, not a meeting. Speed matters during mitigation.

Phase 3: Resolve (30 minutes-4 hours). Once customer impact is mitigated, we find and fix the root cause. This is the phase where we deploy a proper fix, not just a rollback or workaround. The on-call engineer can pull in other team members as needed. If resolution requires more than 2 hours, we schedule a handoff to ensure the on-call engineer is not working a full day plus an incident.

During this phase, the on-call engineer maintains a timeline in the Slack thread. Every significant action (“deployed fix to staging”, “confirmed fix in staging”, “deployed to production”, “verified in production”) is timestamped. This timeline becomes the raw material for the postmortem.

Phase 4: Postmortem (within 48 hours). Every incident with customer impact gets a postmortem. The postmortem is written by the on-call engineer and reviewed by at least one other engineer. We use a structured template:

## Incident Postmortem: [Title]

**Date:** [Date]
**Duration:** [Time from alert to resolution]
**Severity:** [SEV1/SEV2/SEV3]
**Author:** [On-call engineer]

### Summary
[2-3 sentences describing what happened and the customer impact]

### Timeline
[Timestamped sequence of events from first alert to resolution]

### Root Cause
[Technical explanation of what went wrong]

### Contributing Factors
[What made the incident worse or harder to diagnose]

### What Went Well
[Parts of the response that worked effectively]

### Action Items
| Action | Owner | Priority | Due Date |
|--------|-------|----------|----------|
| [Specific, measurable action] | [Name] | [P0/P1/P2] | [Date] |

### Lessons Learned
[What we learned that applies beyond this specific incident]

Severity Levels That Actually Mean Something

We use three severity levels, and each one has a concrete definition tied to customer impact, not engineering judgment:

SEV1: Service is completely unavailable, OR data loss has occurred or is occurring, OR a security breach is in progress. Response time: immediate (PagerDuty page). All hands if needed. External communication to affected customers within 1 hour. In 18 months, we have had two SEV1 incidents.

SEV2: Service is degraded (slow, partial functionality lost) affecting more than 10% of customers, OR a critical business function is impaired. Response time: 15 minutes (PagerDuty page). On-call engineer plus one backup. Customer communication if impact exceeds 30 minutes. We have had 8 SEV2 incidents.

SEV3: Non-critical functionality is impaired, OR a small subset of customers is affected, OR a monitoring alert indicates a developing issue that has not yet reached customer impact. Response time: next business day if outside hours, 1 hour during business hours. On-call engineer only. No external communication. We have had 13 SEV3 incidents.

The bright line between SEV2 and SEV3 is “does this affect more than 10% of customers.” This is measurable (we track active user counts in real-time) and removes subjective judgment from severity classification during high-stress incidents.

Tooling That Works for Eight People

Enterprise incident management tools (ServiceNow, Jira Service Management, FireHydrant) are designed for organizations with dedicated incident managers and formal ITSM processes. For a team of eight, they add overhead without proportional benefit. Our tooling stack is intentionally simple:

PagerDuty for alerting and on-call scheduling. This is the one tool we pay for because reliable alerting is non-negotiable. PagerDuty’s escalation policies, schedule management, and mobile app are well worth the cost.
Slack #incidents channel for real-time communication during incidents. We use a pinned post with the current incident status that gets updated as the situation evolves. All incident communication happens in a thread attached to the initial assessment post.
Grafana + Loki for dashboards and log aggregation. Pre-built dashboards for each service mean the on-call engineer can assess the situation in under 60 seconds. We invested two full days building comprehensive dashboards for each service, and it has paid for itself many times over in faster incident assessment.
Notion for runbooks and postmortems. Runbooks are living documents that get updated after every incident. Postmortems are permanent records that we review quarterly for patterns.
GitHub for tracking action items from postmortems as issues. Action items that are not tracked do not get done. We label them postmortem-action and review them in our weekly planning meeting.

Total cost: approximately $450/month ($29/user for PagerDuty x 4 on-call engineers + Grafana Cloud free tier + existing Notion and GitHub subscriptions). This is less than the cost of one engineer-hour, and it has saved us hundreds of engineer-hours in faster incident resolution.

The Postmortem Culture: Blameless Is Not Optional

Blameless postmortems are the most important cultural element of incident response. If the engineer who caused an incident fears punishment, they will hide information, delay escalation, and avoid writing honest postmortems. This makes every future incident worse because the team loses the opportunity to learn.

Our postmortem rule is explicit: postmortems identify system failures, not personal failures. If an engineer deployed a bug that caused an outage, the postmortem asks “why did our deployment pipeline allow this bug to reach production?” not “why did this engineer write buggy code?” The answer might be: our test suite does not cover this code path, our staging environment does not replicate the production data pattern that triggered the bug, our code review process did not catch the issue because the reviewer lacked context on this subsystem.

Each of these system-level findings produces a concrete action item: add a test case, improve staging data fidelity, assign domain experts as required reviewers for specific code paths. These actions prevent recurrence. Blaming the engineer prevents nothing because you cannot prevent humans from making mistakes; you can only prevent mistakes from reaching customers.

We have had 23 incidents in 18 months. None of them had the same root cause as a previous incident. That is the proof that blameless postmortems and tracked action items work. The system gets better after every incident because every incident produces specific improvements, and those improvements are tracked to completion.

Communication During Incidents

Technical incident response is only half the challenge. The other half is communication: keeping stakeholders informed without diverting engineering attention from the fix. For a small team, this is particularly challenging because the same person who is debugging the issue is often the person who needs to communicate about it.

We solved this by establishing a clear communication protocol that minimizes overhead. The on-call engineer posts updates to the #incidents Slack channel at three specific triggers: when the incident is acknowledged (Phase 1), when mitigation is in place (Phase 2), and when the incident is resolved (Phase 3). Between these triggers, no communication is expected. Stakeholders who want real-time updates can follow the Slack thread, but the on-call engineer is not obligated to respond to questions until the incident is mitigated.

For SEV1 incidents, we designate a communication lead (a different engineer from the one doing the debugging). The communication lead handles all stakeholder questions, drafts customer-facing status page updates, and posts internal updates. This separation of concerns is critical: an engineer who is context-switching between debugging and answering questions will do both poorly. The communication lead does not need deep technical expertise; they need the ability to translate technical status into business impact language.

For external communication, we use a status page hosted on Betterstack. The status page has three states: operational, degraded, and outage. We update it within 15 minutes of confirming customer impact for SEV1 and SEV2 incidents. The update includes: what is affected, what we are doing about it, and when we expect resolution. We do not include technical details on the status page because they confuse non-technical stakeholders and create support ticket volume.

One practice that has worked well for us: we send a brief email to affected customers after every SEV1 and SEV2 incident with a summary of what happened, how long it lasted, and what we are doing to prevent recurrence. This proactive communication builds trust. Customers who learn about an incident from your post-incident email are more understanding than customers who discover the incident themselves and have to contact support for information. The email is written by the communication lead, reviewed by the on-call engineer for technical accuracy, and sent within 24 hours of resolution.

We also maintain an internal incident history dashboard that shows all incidents from the past 12 months, their severity, duration, root cause category, and whether the action items from the postmortem have been completed. This dashboard is visible to the entire team and serves as both a historical record and a motivational tool: seeing the trend line of incidents declining from 3.2 per month to 1.4 per month over 18 months reinforces that the process is working and motivates continued investment in reliability.

What We Still Get Wrong

Our process is not perfect. Three honest admissions:

First, our runbooks go stale. We update them after incidents, but the runbook for a service that has not had an incident in six months drifts out of date as the service evolves. We are experimenting with quarterly runbook review sessions, but attendance is inconsistent. This is a cultural problem, not a tooling problem, and it is the hardest kind to fix.

Second, our on-call rotation of four engineers means each person is on call for one week every month. That is aggressive. During busy periods, the on-call week overlaps with feature work, and engineers report lower quality on both their incident response and their feature work. We are considering expanding the rotation to six engineers to reduce the burden to one week every six weeks, but this requires cross-training two more engineers on all of our services.

Third, we do not practice incident response. We should run game days (simulated incidents) quarterly. We do not, because there is always a more urgent feature to ship. This means that when a novel incident occurs (one that does not match any runbook), our response is slower and more chaotic than it should be. We know this is a gap. We have not closed it yet. If your team is in the same position, at least be honest about it, because acknowledging the gap is the first step toward closing it.

How We Handle Incident Response as a Small Team

The On-Call Rotation: Simple and Sustainable

The Incident Process: Four Phases

Severity Levels That Actually Mean Something

Tooling That Works for Eight People

The Postmortem Culture: Blameless Is Not Optional

Communication During Incidents

What We Still Get Wrong

You may also like

Leave a comment Cancel reply