Skip links

Docker Compose for AI Development: A Practical Guide

Running an AI/ML development environment involves juggling more services than a typical web application. You need your API server, a vector database, a message queue for async inference jobs, a model registry, a monitoring stack, and probably a PostgreSQL instance for application data. On a developer’s laptop, starting all of these services manually—in the right order, with the right configuration, connected to each other correctly—is an error-prone ritual that wastes 15–30 minutes every morning. Docker Compose is the tool that keeps this manageable. This post covers the specific patterns and pitfalls we have encountered running ML workloads in Docker Compose at Harbor Software, including GPU passthrough, large model volumes, health checks for slow-starting services, memory management, and multi-profile configurations for different team roles.

Article Overview

Docker Compose for AI Development: A Practical Guide

9 sections · Reading flow

01
The Base Configuration
02
Memory Management Is Non-Negotiable
03
Model Caching With Named Volumes
04
Health Checks for Slow-Starting Services
05
GPU Passthrough for Local Inference
06
The Development Dockerfile
07
Multi-Profile Configuration
08
Common Pitfalls and How to Avoid Them
09
Debugging Tips

HARBOR SOFTWARE · Engineering Insights

The Base Configuration

Here is the docker-compose.yml we use for local development, stripped to its essentials:

version: '3.8'

services:
  api:
    build:
      context: .
      dockerfile: Dockerfile.dev
    ports:
      - "8000:8000"
    volumes:
      - ./src:/app/src
      - model-cache:/models
    environment:
      - DATABASE_URL=postgresql://harbor:harbor@postgres:5432/harbor
      - QDRANT_URL=http://qdrant:6333
      - REDIS_URL=redis://redis:6379/0
      - MODEL_CACHE_DIR=/models
    depends_on:
      postgres:
        condition: service_healthy
      qdrant:
        condition: service_healthy
      redis:
        condition: service_started
    deploy:
      resources:
        limits:
          memory: 4G

  worker:
    build:
      context: .
      dockerfile: Dockerfile.dev
    command: celery -A app.worker worker --loglevel=info --concurrency=2
    volumes:
      - ./src:/app/src
      - model-cache:/models
    environment:
      - DATABASE_URL=postgresql://harbor:harbor@postgres:5432/harbor
      - REDIS_URL=redis://redis:6379/0
      - MODEL_CACHE_DIR=/models
    depends_on:
      - api
    deploy:
      resources:
        limits:
          memory: 8G

  postgres:
    image: postgres:14-alpine
    environment:
      - POSTGRES_USER=harbor
      - POSTGRES_PASSWORD=harbor
      - POSTGRES_DB=harbor
    volumes:
      - pgdata:/var/lib/postgresql/data
    ports:
      - "5433:5432"
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U harbor"]
      interval: 5s
      timeout: 5s
      retries: 5

  qdrant:
    image: qdrant/qdrant:v1.0.1
    ports:
      - "6333:6333"
    volumes:
      - qdrant-data:/qdrant/storage
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:6333/healthz"]
      interval: 10s
      timeout: 5s
      retries: 10
      start_period: 30s

  redis:
    image: redis:7-alpine
    ports:
      - "6380:6379"

volumes:
  pgdata:
  qdrant-data:
  model-cache:

Several things to note here that are specific to AI workloads and differ from a standard web application Compose setup. Let me walk through each one.

Memory Management Is Non-Negotiable

ML models are memory hogs. A single sentence-transformers model (all-MiniLM-L6-v2) consumes about 250 MB of RAM when loaded. A medium BERT model takes 1.3 GB. A large model like all-mpnet-base-v2 takes 420 MB. If you are loading multiple models for different tasks (text classification, embedding generation, summarization) or running batch inference that accumulates results in memory, your worker can easily consume 6–8 GB.

Without memory limits, a runaway inference job—processing a larger batch than expected, or encountering a memory leak in a preprocessing step—will consume all available memory on the host machine. Docker’s OOM killer will then terminate the container with the highest memory usage, which is often PostgreSQL (because it has a large shared buffer pool), not the offending inference worker. PostgreSQL being killed mid-transaction can cause data corruption. You will spend an hour figuring out why your database is inconsistent, and the root cause will be an ML batch job you ran three containers away.

The deploy.resources.limits.memory setting in the Compose file prevents this cascade. When a container hits its memory limit, Docker kills that specific container instead of letting it steal memory from neighbors. Set it explicitly for every service that touches ML models. Our convention:

  • API server: 4 GB (handles embedding generation for search queries, loads 1–2 models)
  • Worker: 8 GB (handles batch inference, model fine-tuning jobs, loads 2–4 models simultaneously)
  • PostgreSQL: 1 GB (default is fine for development, production needs more)
  • Qdrant: 2 GB (depends on index size; 2 GB handles ~1 million 384-dimensional vectors comfortably)
  • Redis: 512 MB (mostly used for Celery task queue, minimal data storage)

Total: 15.5 GB for the full stack. If your development machine has 16 GB of RAM, this is tight—the host OS and other applications need at least 2–3 GB. Consider using quantized models (INT8 quantization uses 4x less memory with minimal quality loss) or offloading the vector database to a shared development server that the team accesses remotely.

Model Caching With Named Volumes

The model-cache named volume is shared between the api and worker services. This is critical for two reasons:

  1. Download once, use everywhere. ML models are large (250 MB–2 GB each). Without a shared cache, each container downloads its own copy on first run. With four models, that is up to 8 GB of redundant downloads consuming bandwidth and adding 5–15 minutes to your first startup.
  2. Survive container rebuilds. Named volumes persist across docker compose down and docker compose up --build. When you rebuild your application containers (which happens every time you change requirements.txt), the model cache survives. Without persistence, every dependency change triggers a multi-gigabyte model re-download.
# In your application code, set the cache directory for all ML libraries:
import os
os.environ['TRANSFORMERS_CACHE'] = os.environ.get('MODEL_CACHE_DIR', '/models')
os.environ['SENTENCE_TRANSFORMERS_HOME'] = os.environ.get('MODEL_CACHE_DIR', '/models')
os.environ['TORCH_HOME'] = os.environ.get('MODEL_CACHE_DIR', '/models/torch')

# Models downloaded once will be reused across container restarts
# This single configuration saves 5-15 minutes on every container rebuild

One trap: if you mount a host directory as the model cache (e.g., ./models:/models), file permission issues between your host user (typically UID 1000) and the container user (often root or a custom user) will cause cryptic download failures. The error messages from Hugging Face Hub are unhelpful: “OSError: Permission denied” without specifying which file or directory. Named volumes avoid this entirely because Docker manages the filesystem and permissions internally.

Another trap: if multiple containers try to download the same model simultaneously (race condition on first startup), you can get a corrupted model cache. We prevent this by having the API server’s startup script download all required models before the worker starts. The depends_on relationship ensures ordering, and the shared volume ensures the worker sees the models the API server downloaded.

Health Checks for Slow-Starting Services

Vector databases and ML model servers take significantly longer to start than a typical web service. PostgreSQL is ready in 2–3 seconds. Redis is ready instantly. But Qdrant needs 10–30 seconds to load its index into memory, depending on index size. A model server might need 45–90 seconds to load weights from disk into GPU memory.

Without health checks, depends_on only waits for the container to start (the process begins), not for the service to be ready (the service accepts connections). Your API server will crash on startup because it tries to connect to Qdrant before Qdrant has loaded its index. The crash is often non-obvious: the connection attempt times out after 5 seconds, the retry logic makes 3 attempts, and then the API server gives up and exits with a generic “ConnectionError” that does not mention Qdrant by name.

The start_period parameter is essential for ML services. It tells Docker to ignore health check failures during the initial startup window:

  qdrant:
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:6333/healthz"]
      interval: 10s    # Check every 10 seconds
      timeout: 5s      # Each check has 5 seconds to respond
      retries: 10      # Fail after 10 consecutive failures
      start_period: 30s  # Don't count failures during first 30 seconds

Without start_period: 30s, Qdrant will be marked unhealthy during its normal startup sequence because the health endpoint returns 503 while the index is loading. The depends_on: condition: service_healthy in the API service will see the unhealthy status and either refuse to start or start in a degraded state, depending on Docker Compose’s version and configuration.

We also add health checks to our own API service so that downstream services (and the developer’s browser) know when it is truly ready:

  api:
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 10s
      timeout: 5s
      retries: 5
      start_period: 45s  # Models take time to load on first startup

GPU Passthrough for Local Inference

If your development machine has an NVIDIA GPU, you can pass it through to Docker containers for dramatically faster inference. A batch of 1,000 embeddings that takes 45 seconds on CPU completes in 3 seconds on a mid-range GPU (RTX 3060). This acceleration is not just nice to have—it changes the development workflow from “run inference, go get coffee, check results” to “run inference, see results immediately.”

  worker:
    build:
      context: .
      dockerfile: Dockerfile.gpu
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
        limits:
          memory: 8G

This requires the NVIDIA Container Toolkit installed on the host machine. Installation is straightforward on Ubuntu (apt install nvidia-container-toolkit) but requires additional setup on other distributions. Important caveats that the documentation does not emphasize enough:

  • Linux hosts only for full GPU support. On macOS and Windows (including WSL2), GPU passthrough is either absent or requires complex additional configuration that is fragile across Docker Desktop updates. For macOS with Apple Silicon, use the MPS backend in PyTorch directly on the host instead of trying to pass GPU through to Docker. We maintain a docker-compose.gpu.yml override file that only our Linux developers use.
  • Your Dockerfile must use an NVIDIA CUDA base image (nvidia/cuda:11.8.0-runtime-ubuntu22.04), which adds 3–4 GB to your image size. This is why we maintain separate Dockerfiles: Dockerfile.dev (CPU, 800 MB image) and Dockerfile.gpu (GPU, 4.5 GB image). Developers without GPUs should not pay the download cost of the CUDA runtime.
  • GPU memory is separate from the memory limit. The 8 GB RAM limit and GPU VRAM are independent pools. A model loaded into GPU memory does not count against the container’s RAM limit, and vice versa. Monitor GPU memory usage separately using nvidia-smi inside the container.
  • GPU sharing between containers requires configuration. By default, multiple containers cannot share a single GPU efficiently. If your API and worker both need GPU access, use NVIDIA MPS (Multi-Process Service) or accept that only one container uses the GPU at a time.

The Development Dockerfile

AI development Dockerfiles differ from production Dockerfiles in one important way: you need system-level dependencies for scientific computing libraries that are not required at runtime.

# Dockerfile.dev
FROM python:3.10-slim

# System dependencies for numpy, scipy, and tokenizers (Rust-based)
RUN apt-get update && apt-get install -y --no-install-recommends 
    build-essential 
    curl 
    libffi-dev 
    && rm -rf /var/lib/apt/lists/*

WORKDIR /app

# Install dependencies first (cached layer)
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Source code is bind-mounted, not copied
# COPY ./src /app/src  # Not needed for dev

CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000", "--reload"]

The --reload flag combined with the bind-mounted ./src:/app/src volume gives you hot-reloading: change a Python file on your host machine and the server restarts automatically inside the container. This brings Docker-based development close to the ergonomics of running directly on the host. The rebuild is fast (under 2 seconds for a typical change) because only the application code changes—the Python interpreter and all dependencies are cached in the Docker layer.

The build-essential package is needed because several Python ML libraries (numpy, scipy, tokenizers) have C/Rust extensions that require a compiler during installation. In a production multi-stage Dockerfile, you would compile these in a build stage and copy only the compiled binaries to the runtime stage. For development, the simpler single-stage approach is fine because image size is less critical.

Multi-Profile Configuration

Not every developer needs every service running. A frontend engineer working on the dashboard does not need the GPU worker or the vector database. A data scientist experimenting with models does not need the full monitoring stack. Docker Compose profiles let you define subsets of the stack that can be started independently:

services:
  api:
    profiles: ["core", "full"]
    # ...

  worker:
    profiles: ["core", "full"]
    # ...

  postgres:
    profiles: ["core", "full"]
    # ...

  qdrant:
    profiles: ["core", "full"]
    # ...

  redis:
    profiles: ["core", "full"]
    # ...

  prometheus:
    profiles: ["monitoring", "full"]
    image: prom/prometheus:v2.37.0
    # ...

  grafana:
    profiles: ["monitoring", "full"]
    image: grafana/grafana:9.3.0
    # ...

  jupyter:
    profiles: ["notebook", "full"]
    image: jupyter/scipy-notebook:latest
    volumes:
      - ./notebooks:/home/jovyan/work
      - model-cache:/models
    ports:
      - "8888:8888"
    # ...
# Start only core services (API + worker + databases)
docker compose --profile core up

# Start core + monitoring
docker compose --profile core --profile monitoring up

# Start everything
docker compose --profile full up

# Start just the notebook for data science work
docker compose --profile core --profile notebook up

This reduces startup time and memory usage for developers who only need a subset of the stack. Our “core” profile uses 15.5 GB of RAM at peak; the “full” profile with monitoring and notebooks uses 19 GB. The “core” profile starts in 15 seconds; “full” takes 45 seconds because Grafana and Prometheus need to initialize their storage.

Common Pitfalls and How to Avoid Them

Pitfall 1: Using docker compose up --build with a large model cache in the build context. If your Dockerfile COPYs the model directory into the image, every rebuild re-copies gigabytes of model files, even if they have not changed. Use bind mounts or named volumes for model caches, never COPY. Also ensure your .dockerignore excludes the model directory if it exists locally.

Pitfall 2: Running pip install on every build. Layer your Dockerfile so that requirements.txt is copied and installed before your source code. Docker caches layers sequentially—if requirements.txt has not changed, the pip install layer is reused from cache. If you copy source code first, any code change invalidates the pip layer and triggers a full reinstall (2–5 minutes for a typical ML project).

Pitfall 3: Port conflicts. We map container ports to non-default host ports (PostgreSQL on 5433 instead of 5432, Redis on 6380 instead of 6379). This avoids conflicts with services running directly on the host machine, which is common when developers have PostgreSQL installed locally for other projects. The environment variables inside containers still use the standard ports (5432, 6379) because inter-container communication uses the Docker network, not host ports.

Pitfall 4: Forgetting .dockerignore. Without a .dockerignore file, Docker’s build context includes your entire project directory—including the .git directory (often 100–500 MB), node_modules, __pycache__, local model caches, training data, and checkpoint files. We have seen build contexts exceed 10 GB because a developer had a training dataset in the project root. This slows every build dramatically because Docker copies the entire context to the daemon before starting.

# .dockerignore
.git
__pycache__
*.pyc
.env
.env.*
models/
data/
checkpoints/
.mypy_cache
.pytest_cache
node_modules
*.egg-info
dist/
build/
.coverage
htmlcov/

Pitfall 5: Not pinning image versions. Using postgres:latest instead of postgres:14-alpine means your development environment changes unpredictably when Docker pulls a new version. We pin every image to a specific major.minor version to ensure all developers run the same versions, and we update versions deliberately in PRs that include any necessary migration steps.

Debugging Tips

When things go wrong in a Docker Compose ML stack, these commands save significant time:

# View real-time logs for a specific service
docker compose logs -f worker

# Check memory usage per container
docker stats --no-stream

# Exec into a running container for debugging
docker compose exec worker bash

# Check if a service is healthy
docker compose ps

# View the full configuration including all variable substitutions
docker compose config

# Remove all volumes (nuclear option - re-downloads models)
docker compose down -v

The docker stats command is particularly valuable for ML workloads. It shows real-time memory usage per container, which lets you catch memory leaks and size your memory limits correctly. If your worker consistently uses 5.8 GB of its 8 GB limit, you know the limit is appropriate. If it is using 7.9 GB and occasionally getting OOM-killed, you need to either increase the limit or reduce the number of concurrent models loaded.

Conclusion

Docker Compose is indispensable for AI development environments. The key differences from standard web application setups are: explicit memory limits (ML models will consume everything you give them), named volumes for large model caches (download once, share everywhere), generous health check timeouts for slow-starting services (vector databases and model servers are not instant), profile-based configuration for different team roles, and careful .dockerignore files to prevent multi-gigabyte build contexts. Get these right, and your team spends its time on ML engineering instead of fighting local environment issues. Get them wrong, and you will spend every Monday morning debugging why the inference worker crashed over the weekend and took PostgreSQL with it.

Leave a comment

Explore
Drag