Skip links

Infrastructure as Code with Terraform: Lessons from Production

We migrated Harbor Software’s infrastructure to Terraform three years ago. In that time, we have managed over 400 resources across three AWS accounts, handled two major Terraform version upgrades, and recovered from exactly one state file corruption incident that cost us a full day of engineering time. Along the way we accumulated a set of hard-won lessons that would have saved us weeks of pain if someone had written them down for us.

Article Overview

Infrastructure as Code with Terraform: Lessons from Produ…

7 sections · Reading flow

01
State Management Is the Whole Game
02
Module Design: Opinionated Beats Flexible
03
CI/CD Integration: The Plan/Apply Split
04
Handling Secrets in Terraform
05
Version Pinning and Upgrades
06
What We Would Do Differently
07
Terraform Import and State Surgery

HARBOR SOFTWARE · Engineering Insights

This is not a Terraform tutorial. You can find plenty of those. This is the collection of production lessons that tutorials never cover: state management pitfalls, module design mistakes, CI/CD integration patterns, and the organizational decisions that determine whether IaC improves your velocity or destroys it.

State Management Is the Whole Game

If you get state management wrong, nothing else matters. Terraform’s state file is the single source of truth for what exists in your cloud account. It maps the resources defined in your HCL code to real resources in AWS (or GCP, or Azure). Corrupt it, lose it, or let it drift from reality, and you are in for a very bad day that usually involves manually reconciling resources through the cloud console.

Remote State with Locking

Local state files are a non-starter for any team larger than one person. We use S3 + DynamoDB for remote state with locking, and this configuration has been rock-solid:

# backend.tf
terraform {
  backend "s3" {
    bucket         = "harbor-terraform-state"
    key            = "production/network/terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true
    dynamodb_table = "terraform-locks"
  }
}

The DynamoDB table provides locking, which prevents two people (or two CI jobs) from running terraform apply simultaneously. Without locking, concurrent applies will corrupt your state file. We learned this the hard way in month two when two developers ran apply at the same time and ended up with a state file that referenced resources in contradictory states. One developer was adding a subnet while the other was modifying the VPC’s CIDR block. The resulting state file had the new subnet referencing CIDR ranges that no longer existed. It took four hours of terraform import and terraform state rm to repair.

The S3 bucket itself should be treated as critical infrastructure. Enable versioning (so you can recover from state file corruption by restoring a previous version), enable MFA delete (so nobody can accidentally or maliciously delete the state file), and restrict access to a dedicated IAM role that only Terraform uses. We also enable S3 object lock with a 30-day retention period as a last line of defense.

State File Per Environment Per Component

One of the biggest mistakes teams make is putting everything in a single state file. We started with one state file for all of production. It contained our VPC, RDS instances, ECS services, S3 buckets, IAM roles, CloudFront distributions, Route53 records, and more. Every terraform plan took 4 minutes because it had to refresh 200+ resources by calling AWS APIs for each one. Every apply was terrifying because a typo in a security group rule could theoretically cascade into a VPC change through dependency chains.

We now split state into component-based directories:

terraform/
  production/
    network/          # VPC, subnets, NAT gateways, route tables
    database/         # RDS, ElastiCache, parameter groups
    compute/          # ECS clusters, task definitions, services
    storage/          # S3 buckets, lifecycle policies
    dns/              # Route53 zones and records
    monitoring/       # CloudWatch alarms, dashboards, SNS topics
    iam/              # IAM roles, policies, instance profiles
  staging/
    network/
    database/
    ...

Each directory has its own state file, its own backend configuration, and its own terraform apply lifecycle. Plans are fast (under 10 seconds). The blast radius of any single apply is limited to one component. If you make a mistake in the compute configuration, you cannot accidentally destroy your network or database because they are in separate state files with separate apply operations.

Cross-component references use terraform_remote_state data sources or SSM parameters. We prefer SSM parameters for loose coupling:

# network/outputs.tf - publish to SSM
resource "aws_ssm_parameter" "private_subnet_ids" {
  name  = "/infrastructure/production/network/private_subnet_ids"
  type  = "StringList"
  value = join(",", aws_subnet.private[*].id)
}

# compute/main.tf - consume from SSM
data "aws_ssm_parameter" "private_subnet_ids" {
  name = "/infrastructure/production/network/private_subnet_ids"
}

resource "aws_ecs_service" "app" {
  network_configuration {
    subnets = split(",", data.aws_ssm_parameter.private_subnet_ids.value)
  }
}

SSM parameters are preferable to terraform_remote_state because they do not create a hard dependency between state files. With remote state data sources, if the network state file is locked or corrupted, the compute component cannot plan. With SSM parameters, the values are read from AWS directly and are always available.

Module Design: Opinionated Beats Flexible

Terraform modules are the primary abstraction mechanism, and most teams design them wrong. The temptation is to make modules extremely flexible with dozens of input variables that can customize every aspect of the created resources. This produces modules that are harder to use than raw resources, because you need to understand both the module’s abstraction and the underlying resource to configure them correctly.

The 80/20 Rule for Module Inputs

A well-designed module should handle 80% of use cases with zero or minimal configuration. The remaining 20% should be achievable with a small number of clearly named variables. Here is the difference:

# Bad: too many knobs, basically re-exposing all resource arguments
module "ecs_service" {
  source = "./modules/ecs-service"

  name                          = "api"
  cluster_id                    = aws_ecs_cluster.main.id
  task_definition_arn           = aws_ecs_task_definition.api.arn
  desired_count                 = 3
  deployment_minimum_percent    = 50
  deployment_maximum_percent    = 200
  health_check_grace_period     = 60
  enable_circuit_breaker        = true
  circuit_breaker_rollback      = true
  enable_execute_command        = false
  propagate_tags                = "SERVICE"
  scheduling_strategy           = "REPLICA"
  force_new_deployment          = true
  wait_for_steady_state         = true
  enable_service_connect        = false
  # ... 20 more variables
}

# Good: sensible defaults, minimal required inputs
module "ecs_service" {
  source = "./modules/ecs-service"

  name       = "api"
  cluster_id = aws_ecs_cluster.main.id
  image      = "123456789.dkr.ecr.us-east-1.amazonaws.com/api:v1.2.3"
  port       = 8080
  cpu        = 512
  memory     = 1024
  replicas   = 3

  environment = {
    DATABASE_URL = var.database_url
    REDIS_URL    = var.redis_url
  }
}

The good module encapsulates all the ECS boilerplate (task definition, service, target group, security group, IAM roles, CloudWatch log group, autoscaling configuration) behind a simple interface. The deployment configuration, circuit breaker, and other operational settings use sensible defaults that can be overridden if needed. Internally, the module creates 8-12 resources, but the consumer only needs to provide 7 variables.

Composition Over Inheritance

Terraform does not support module inheritance, and that is actually a good thing. Inheritance creates tight coupling and makes it hard to understand what a module does without tracing through a chain of parent modules. Instead, compose larger modules from smaller ones:

# modules/web-app/main.tf - composes three smaller modules
module "ecs_service" {
  source   = "../ecs-service"
  name     = var.name
  image    = var.image
  port     = var.port
  replicas = var.replicas
}

module "alb_target" {
  source           = "../alb-target"
  name             = var.name
  vpc_id           = var.vpc_id
  listener_arn     = var.listener_arn
  port             = var.port
  health_check_path = var.health_check_path
}

module "cloudwatch_alarms" {
  source       = "../cloudwatch-alarms"
  service_name = var.name
  cluster_name = var.cluster_name
  sns_topic    = var.alarm_sns_topic
}

The web-app module composes three smaller modules. Each smaller module is independently testable, reusable in different contexts, and has a clear single responsibility. You can use ecs-service for a background worker that does not need an ALB, or cloudwatch-alarms for any ECS service regardless of how it was created. This composability is worth more than any amount of configurability.

CI/CD Integration: The Plan/Apply Split

Running Terraform in CI/CD is where most teams either get it right or create a security and reliability nightmare. The critical principle is: plan on every PR, apply only on merge to main. This ensures that every infrastructure change is reviewed before it takes effect, and that the main branch always represents the current state of infrastructure.

# .github/workflows/terraform.yml
name: Terraform
on:
  pull_request:
    paths: ['terraform/**']
  push:
    branches: [main]
    paths: ['terraform/**']

jobs:
  plan:
    if: github.event_name == 'pull_request'
    runs-on: ubuntu-latest
    strategy:
      matrix:
        component: [network, database, compute, storage, dns]
    steps:
      - uses: actions/checkout@v3
      - uses: hashicorp/setup-terraform@v2
        with:
          terraform_version: 1.3.9
      - name: Terraform Init
        run: terraform init
        working-directory: terraform/production/${{ matrix.component }}
      - name: Terraform Plan
        run: terraform plan -out=tfplan 2>&1 | tee plan-output.txt
        working-directory: terraform/production/${{ matrix.component }}
      - name: Comment Plan on PR
        uses: actions/github-script@v6
        with:
          script: |
            const plan = require('fs').readFileSync(
              'terraform/production/${{ matrix.component }}/plan-output.txt', 'utf8'
            );
            const truncated = plan.length > 60000
              ? plan.substring(0, 60000) + 'n... (truncated)'
              : plan;
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: `## Terraform Plan: ${{ matrix.component }}n```n${truncated}n````
            });

  apply:
    if: github.event_name == 'push' && github.ref == 'refs/heads/main'
    runs-on: ubuntu-latest
    environment: production
    strategy:
      matrix:
        component: [network, database, compute, storage, dns]
    steps:
      - uses: actions/checkout@v3
      - uses: hashicorp/setup-terraform@v2
      - name: Terraform Apply
        run: terraform init && terraform apply -auto-approve
        working-directory: terraform/production/${{ matrix.component }}

The plan output is posted as a PR comment so reviewers can see exactly what infrastructure changes will be made. This is invaluable. Reviewing HCL code alone does not tell you whether a change will create a new resource, modify an existing one, or destroy and recreate one (which can cause downtime). The plan output shows the exact delta. The apply job uses GitHub Environments with required reviewers for production, adding a manual approval gate even after the code is merged.

Detecting Drift

Infrastructure drift happens when someone makes a change through the AWS console, CLI, or another tool without updating Terraform. This is inevitable in any organization. Someone will click a button in the console during an incident. A support engineer will modify a security group rule to debug a connectivity issue. An automated process will update a resource that Terraform also manages.

We run a scheduled drift detection job daily:

# .github/workflows/drift-detection.yml
on:
  schedule:
    - cron: '0 8 * * *'  # Daily at 8 AM UTC

jobs:
  detect-drift:
    runs-on: ubuntu-latest
    strategy:
      matrix:
        component: [network, database, compute, storage, dns, monitoring, iam]
    steps:
      - uses: actions/checkout@v3
      - uses: hashicorp/setup-terraform@v2
      - name: Terraform Plan (Drift Check)
        id: plan
        run: |
          terraform init
          terraform plan -detailed-exitcode 2>&1 | tee drift-output.txt
        working-directory: terraform/production/${{ matrix.component }}
        continue-on-error: true
      - name: Notify on Drift
        if: steps.plan.outcome == 'failure'
        run: |
          curl -X POST $SLACK_WEBHOOK 
            -d '{"text": "Infrastructure drift detected in production/${{ matrix.component }}. Check the daily drift report."}'

The -detailed-exitcode flag makes Terraform return exit code 0 (no changes), 1 (error), or 2 (changes detected, meaning drift). This has caught unauthorized changes within 24 hours on three separate occasions, including a security group modification during an incident that was never reverted and would have left a port open indefinitely.

Handling Secrets in Terraform

Never put secrets in Terraform state or variable files. This is a common mistake because Terraform’s state file stores the values of all resources, including sensitive ones, in plaintext JSON. Even with encrypted S3 state storage, anyone with read access to the state file can see secret values.

We use AWS Secrets Manager for application secrets and reference them in Terraform without exposing the values:

# Create the secret container in Terraform
resource "aws_secretsmanager_secret" "database_url" {
  name = "production/database-url"
  description = "Production database connection string"
  recovery_window_in_days = 7
}

# The actual secret value is set manually via AWS CLI or console
# Terraform only manages the secret's existence, policies, and rotation

# Reference in ECS task definition - value is resolved at runtime by ECS
resource "aws_ecs_task_definition" "app" {
  container_definitions = jsonencode([{
    name = "app"
    image = var.image
    secrets = [
      {
        name      = "DATABASE_URL"
        valueFrom = aws_secretsmanager_secret.database_url.arn
      },
      {
        name      = "REDIS_URL"
        valueFrom = aws_secretsmanager_secret.redis_url.arn
      }
    ]
  }])
}

This pattern keeps secret values out of the state file entirely. Terraform manages the secret resource (its existence, IAM policies, rotation configuration, tags), but the actual secret value is injected at runtime by ECS, which retrieves it from Secrets Manager using the task’s IAM role. The secret value never appears in terraform plan output, state files, or CI logs.

Version Pinning and Upgrades

Pin everything. Provider versions, module versions, Terraform itself. Unpinned versions will break your builds on a random Tuesday morning when a new provider version introduces a breaking change or deprecates a resource attribute you depend on.

terraform {
  required_version = "~> 1.3.0"

  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 4.60.0"
    }
    random = {
      source  = "hashicorp/random"
      version = "~> 3.4.0"
    }
  }
}

We use the pessimistic constraint operator (~>) which allows patch version updates but blocks minor and major version changes. This means ~> 4.60.0 allows 4.60.1, 4.60.2, etc., but not 4.61.0 or 5.0.0. Upgrades are deliberate: we create a branch, update the version constraint, run terraform init -upgrade, review the plan carefully (paying special attention to any resource replacements), test in staging, and merge.

For major Terraform version upgrades (0.12 to 0.13, 0.14 to 1.0, etc.), we always test in staging first and keep a rollback plan. The 0.12 to 0.13 upgrade took us a full day because of the new provider source syntax and the required state migration. The 1.0 to 1.3 upgrade was smooth because HashiCorp committed to backward compatibility within the 1.x line. The lesson: minor version upgrades are routine, major version upgrades are projects.

What We Would Do Differently

If we were starting over today, three things would change:

  1. Terragrunt from day one. We spent weeks building our own wrapper scripts for DRY backend configuration and environment management. Terragrunt solves all of this out of the box. The generate blocks for backend configuration, the include blocks for shared variables, and the dependency blocks for cross-component ordering would have saved us hundreds of lines of duplicated configuration and several custom shell scripts.
  2. Smaller modules earlier. Our first modules were monolithic. The ecs-service module created 14 resources including the ALB target group, security group, IAM role, CloudWatch log group, and autoscaling policies. When we needed an ECS service without an ALB (for a background worker), we had to refactor the entire module and update every consumer. Start small, compose up.
  3. Policy as code from the start. We added Checkov and OPA policies after the fact, which meant retrofitting existing resources to pass policy checks. Starting with policies on day one means every resource is compliant from birth, and you never accumulate a backlog of non-compliant infrastructure.

Terraform Import and State Surgery

No matter how disciplined your team is, you will eventually need to bring existing resources under Terraform management or fix state inconsistencies. The terraform import command maps an existing cloud resource to a Terraform resource definition, adding it to the state file without creating or modifying the actual resource.

# Import an existing S3 bucket into Terraform management
terraform import aws_s3_bucket.legacy_uploads harbor-legacy-uploads

# Import an existing RDS instance
terraform import aws_db_instance.legacy_database harbor-production-db

# After import, run terraform plan to see what Terraform thinks
# needs to change. Often the plan shows "changes" because your
# HCL does not perfectly match every attribute of the existing resource.
# Adjust your HCL until the plan shows no changes.

State surgery via terraform state commands is the last resort for fixing state corruption or reorganizing resources between state files. The most common operations we use:

  • terraform state list: See all resources in the current state. Essential for understanding what Terraform manages before making changes.
  • terraform state show aws_ecs_service.api: See the full attributes of a specific resource in state. Useful for debugging drift between state and reality.
  • terraform state mv aws_s3_bucket.old_name aws_s3_bucket.new_name: Rename a resource in state without destroying and recreating it. We use this when refactoring module structures.
  • terraform state rm aws_ecs_service.removed: Remove a resource from state without destroying the actual cloud resource. Useful when you want Terraform to stop managing a resource that was moved to another state file or is being managed manually.

Before any state surgery, always create a backup of the state file. With S3 versioning enabled on your state bucket, you can restore the previous version if something goes wrong. We also keep a local backup:

# Always backup state before surgery
terraform state pull > state-backup-$(date +%Y%m%d-%H%M%S).json

# Perform the state operation
terraform state mv module.old_name module.new_name

# Verify: plan should show no changes if the mv was correct
terraform plan

If the plan after state surgery shows unexpected changes, restore the backup immediately and investigate before proceeding. State surgery mistakes can result in Terraform attempting to destroy and recreate resources, which for databases and other stateful resources means data loss.

Conclusion

Infrastructure as Code with Terraform is not about writing HCL files. It is about building a reliable, repeatable process for managing cloud infrastructure that scales with your team and your product. The technology is the easy part. The hard parts are state management strategy, module design discipline, CI/CD integration, secret handling, and version management.

Get those right and Terraform becomes the backbone of your infrastructure practice, giving you reproducibility, auditability, and the confidence to make changes without fear. Get them wrong and you end up with a state file that nobody trusts, modules that nobody wants to touch, and a CI pipeline that everyone is afraid to run. Start with remote state and locking. Split your state by component. Design opinionated modules. Pin your versions. Run plans on PRs and applies on merge. Detect drift daily. And never, ever put secrets in your state file.

Leave a comment

Explore
Drag