Secrets Management in Production: Beyond Environment Variables
Every production application has secrets: database passwords, API keys, TLS certificates, signing keys, encryption keys. The most common approach to managing these secrets is environment variables, and for many teams, the journey ends there. .env files in development, environment variables in CI, and deployment platform secrets in production. This works until it does not. After three separate incidents involving leaked or mismanaged secrets across Harbor Software’s client projects, we overhauled our secrets management approach entirely. This post covers why environment variables are insufficient and what to use instead.
Why Environment Variables Are Not Enough
Environment variables solve the first-order problem of keeping secrets out of source code. But they create second-order problems that become critical as your system grows:
No access control. Every process running in an environment has access to all environment variables in that environment. Your payment processing service can read the email service’s API key. A debug endpoint that dumps the process environment exposes every secret. A library that logs environment variables (for “debugging purposes”) exfiltrates every credential in the process. We audited our Node.js dependencies and found two packages that logged process.env to a third-party error tracking service. Neither was malicious; they were trying to capture diagnostic context. But the effect was the same: our database credentials were transmitted to an external service. There is no principle of least privilege with flat environment variables.
No audit trail. When a secret is accessed via an environment variable, there is no log entry. You cannot answer the question “which process read the database password at 3:47 AM?” because the operating system does not track reads of environment variables. If a secret is compromised, you have no forensic data about when and how the compromise occurred. You cannot determine whether the compromise happened yesterday or six months ago, which means you cannot scope the potential damage.
No rotation support. Rotating a secret stored in an environment variable requires restarting every process that uses that secret. In a microservice environment with 15 services, that means a coordinated restart across all services. If one service does not pick up the new value (because it cached the old one at startup), it breaks. If the rotation is not atomic across all services, there is a window where some services use the old secret and others use the new one, which can cause intermittent authentication failures that are extremely difficult to debug because they depend on which service handles a given request.
No encryption at rest. Environment variables are stored in plaintext in process memory, in container definitions, in CI configuration, and in deployment manifests. Anyone with access to the deployment platform, the container runtime, or the CI system can read every secret. A compromised CI pipeline (which is a common attack vector via supply chain attacks on CI plugins or leaked CI tokens) exposes all secrets used in that pipeline. A developer with access to Kubernetes can run kubectl get secret -o json and see every secret in base64 (not encrypted, just encoded).
Secret sprawl. As the number of services and environments grows, the number of secrets grows quadratically. 15 services across 3 environments (dev, staging, prod) with 5 secrets each is 225 secret values to manage. Environment variables provide no organizational structure for this complexity. There is no way to see all secrets for a given service, all services that use a given secret, or all secrets that have not been rotated in 90 days.
The Secrets Management Stack
A proper secrets management system addresses all five of the above problems. The tools we use:
HashiCorp Vault as the central secrets store. Vault provides encrypted storage, fine-grained access policies, an audit log for every read and write, dynamic secret generation, and automatic rotation for supported backends (databases, cloud provider credentials, PKI certificates).
External Secrets Operator (ESO) for Kubernetes. ESO syncs secrets from Vault into Kubernetes Secrets, so applications read secrets through the standard Kubernetes Secrets API without knowing about Vault. This means applications do not need Vault client libraries or Vault-specific configuration. The ESO controller runs as a single pod in the cluster and handles authentication, secret fetching, and synchronization for all services.
For teams not on Kubernetes, the equivalent approach is to use Vault Agent, a sidecar process that authenticates to Vault, fetches secrets, and renders them as files on disk that the application reads. The application reads secrets from files rather than environment variables, which provides the same decoupling.
Vault Architecture
# vault-config.hcl
storage "raft" {
path = "/vault/data"
node_id = "vault-1"
}
listener "tcp" {
address = "0.0.0.0:8200"
tls_cert_file = "/vault/tls/server.crt"
tls_key_file = "/vault/tls/server.key"
}
ui = true
max_lease_ttl = "768h"
default_lease_ttl = "168h"
We organize secrets by service and environment using a hierarchical path structure:
# Secret paths in Vault
secret/data/production/order-service/database
secret/data/production/order-service/stripe
secret/data/production/user-service/database
secret/data/production/user-service/sendgrid
secret/data/staging/order-service/database
# ...
Access policies restrict each service to only the secrets it needs. This is the principle of least privilege applied to secrets: the order service can read its own database credentials but cannot read the user service’s credentials, even though both run in the same Kubernetes cluster.
# order-service-production.hcl
path "secret/data/production/order-service/*" {
capabilities = ["read"]
}
# Explicitly deny access to other services' secrets
path "secret/data/production/user-service/*" {
capabilities = ["deny"]
}
path "secret/data/staging/*" {
capabilities = ["deny"]
}
Dynamic Secrets: The Game Changer
The most powerful feature of Vault is dynamic secrets: short-lived credentials generated on demand and automatically revoked after expiration. Instead of a single, long-lived database password shared by all instances of a service, each instance gets its own unique credentials that expire after 1 hour.
# Configure Vault's database secrets engine
vault write database/config/production-db
plugin_name=postgresql-database-plugin
allowed_roles="order-service-readonly,order-service-readwrite"
connection_url="postgresql://{{username}}:{{password}}@db.internal:5432/production"
username="vault_admin"
password="$VAULT_DB_ADMIN_PASSWORD"
# Define a role that creates read-only credentials
vault write database/roles/order-service-readonly
db_name=production-db
creation_statements="CREATE ROLE "{{name}}" WITH LOGIN PASSWORD '{{password}}' VALID UNTIL '{{expiration}}'; GRANT SELECT ON ALL TABLES IN SCHEMA public TO "{{name}}";"
default_ttl="1h"
max_ttl="24h"
# Application requests credentials (via Vault API or ESO)
vault read database/creds/order-service-readonly
# Returns:
# username: v-order-serv-readonly-abc123
# password: A1b2C3d4E5f6G7h8
# lease_id: database/creds/order-service-readonly/abc123
# lease_duration: 3600
Dynamic secrets eliminate three problems simultaneously. First, there is no long-lived credential to steal. If an attacker compromises a running instance, they get credentials that expire in an hour, limiting the window of exploitation. Second, if credentials are compromised, Vault’s audit log shows exactly which lease was accessed and from which IP address, which narrows the forensic investigation dramatically. Third, credential rotation is automatic and transparent; the application simply requests new credentials when the old ones expire. There is no coordinated restart, no window of inconsistency, no risk of one service using stale credentials.
The operational benefit is significant. Before dynamic secrets, we rotated database passwords quarterly (which is too infrequent but realistic for a small team doing it manually). With dynamic secrets, credentials are effectively rotated every hour with zero operational effort.
Application-Level Integration
Applications should never directly call the Vault API in request-handling code. Secret fetching should happen at startup and be refreshed in the background. Here is our standard pattern for a Node.js service:
import Vault from 'node-vault';
class SecretManager {
private secrets: Map<string, string> = new Map();
private leases: Map<string, {
leaseId: string; expiry: number
}> = new Map();
private vault: any;
constructor() {
this.vault = Vault({
endpoint: process.env.VAULT_ADDR,
token: process.env.VAULT_TOKEN,
});
}
async initialize(secretPaths: string[]): Promise<void> {
for (const path of secretPaths) {
await this.fetchSecret(path);
}
this.startRenewalLoop();
}
private async fetchSecret(path: string): Promise<void> {
const result = await this.vault.read(path);
for (const [key, value] of Object.entries(result.data.data)) {
this.secrets.set(`${path}/${key}`, value as string);
}
if (result.lease_id) {
this.leases.set(path, {
leaseId: result.lease_id,
expiry: Date.now() + (result.lease_duration * 1000),
});
}
}
get(path: string, key: string): string {
const value = this.secrets.get(`${path}/${key}`);
if (!value) throw new Error(`Secret not found: ${path}/${key}`);
return value;
}
private startRenewalLoop(): void {
setInterval(async () => {
for (const [path, lease] of this.leases.entries()) {
if (Date.now() > lease.expiry * 0.75) {
try {
await this.vault.lease.renew({
lease_id: lease.leaseId
});
} catch {
await this.fetchSecret(path);
}
}
}
}, 60000);
}
}
// Usage at application startup
const secrets = new SecretManager();
await secrets.initialize([
'secret/data/production/order-service/database',
'secret/data/production/order-service/stripe',
]);
const dbPassword = secrets.get(
'secret/data/production/order-service/database',
'password'
);
The renewal loop runs every 60 seconds and renews leases when 75% of the TTL has elapsed. If renewal fails (Vault is temporarily unavailable), it fetches fresh credentials. This approach handles Vault outages gracefully: the application continues operating with its current credentials until they expire, and reconnects to Vault when it becomes available.
Handling Secret Rotation Gracefully
Even with dynamic secrets, some credentials require explicit rotation: third-party API keys, webhook signing secrets, encryption keys, and TLS certificates for services that do not support ACME-based auto-renewal. The challenge is rotating these secrets without downtime.
We use a dual-key rotation strategy for API keys and signing secrets. When rotation is triggered (either manually or by an automated schedule), the system generates a new key and stores it alongside the old key in Vault. Both keys are valid simultaneously during a grace period (typically 24 hours for API keys, 72 hours for encryption keys). All services are updated to use the new key for outgoing requests while still accepting the old key for incoming requests. After the grace period, the old key is revoked.
# Vault stores both current and previous versions
vault kv put secret/production/webhooks/signing
current_key="new_key_abc123"
previous_key="old_key_xyz789"
rotation_started="2025-02-07T10:00:00Z"
grace_period_hours=24
# Application reads both keys
async function verifyWebhookSignature(payload, signature) {
const keys = await secrets.get(
'secret/data/production/webhooks/signing'
);
// Try current key first, fall back to previous key
if (verifyHMAC(payload, signature, keys.current_key)) {
return true;
}
if (keys.previous_key &&
verifyHMAC(payload, signature, keys.previous_key)) {
metrics.increment('webhook.signature.used_previous_key');
return true;
}
return false;
}
For encryption keys, rotation is more complex because existing data was encrypted with the old key. We handle this with envelope encryption: data is encrypted with a data encryption key (DEK), and the DEK is encrypted with a key encryption key (KEK) stored in Vault. Rotating the KEK means re-encrypting all DEKs with the new KEK, which is fast (DEKs are small) and does not require re-encrypting the actual data. When a DEK is decrypted, we check which KEK version was used and lazily re-encrypt with the current KEK if needed.
TLS certificate rotation is handled by cert-manager in Kubernetes, which monitors certificate expiration and requests new certificates from Let’s Encrypt (or your internal CA) before the current certificate expires. The key detail is that cert-manager updates the Kubernetes Secret containing the certificate, and the ingress controller (we use Nginx Ingress) watches for Secret changes and reloads the certificate without dropping connections. This gives us fully automated TLS certificate rotation with zero downtime and zero manual intervention.
We track rotation compliance with a Vault audit report that runs weekly. The report identifies any static secret that has not been rotated in 90 days and creates a Jira ticket for the owning team. Our current compliance rate is 94%: 6% of static secrets are past their 90-day rotation window at any given time, usually because they require coordination with a third-party vendor who is slow to process key changes. The 90-day threshold is a balance between security (shorter is better) and operational overhead (rotating keys takes time, especially when external parties are involved).
CI/CD Pipeline Secrets
CI/CD pipelines are a frequent source of secret leaks because they combine high privilege (access to production deployment credentials) with high exposure (build logs, artifact storage, third-party integrations). Our CI/CD secrets policy:
- Never store secrets in CI environment variables for production deployments. Use OIDC federation instead. GitHub Actions supports OIDC tokens that can authenticate to Vault, AWS, GCP, and Azure without storing any static credentials in the CI system. The OIDC token is scoped to the specific repository, branch, and workflow, so a compromised workflow in one repository cannot access secrets for another.
- Mask secrets in logs. GitHub Actions automatically masks secrets stored in its secrets store, but it cannot mask secrets fetched at runtime from Vault. We add explicit masking with
::add-mask::for any secret value that might appear in logs. - Limit secret scope to the minimum required step. Do not set secrets as environment variables for the entire job. Scope them to the specific step that needs them. A secret available to every step in a job is a secret available to every action in that job, including third-party actions that might log or exfiltrate it.
# .github/workflows/deploy.yml
jobs:
deploy:
runs-on: ubuntu-latest
permissions:
id-token: write # Required for OIDC
contents: read
steps:
- uses: actions/checkout@v4
- name: Authenticate to Vault
uses: hashicorp/vault-action@v3
with:
url: https://vault.internal.harborsoftware.com
method: jwt
role: deploy-production
jwtGithubAudience: https://vault.internal.harborsoftware.com
secrets: |
secret/data/production/deploy/kubernetes kubeconfig | KUBECONFIG_CONTENT;
- name: Deploy
run: |
echo "$KUBECONFIG_CONTENT" > /tmp/kubeconfig
chmod 600 /tmp/kubeconfig
kubectl --kubeconfig=/tmp/kubeconfig apply -f k8s/
rm -f /tmp/kubeconfig
Secrets in Development
The weakest link in secrets management is usually the developer’s local machine. Developers need access to secrets to run applications locally, but local machines are the least controlled environment in the stack. Our approach:
Use a separate Vault namespace for development with credentials that are either fake (for services that can be stubbed locally) or scoped to a sandboxed development environment (for services that require real credentials). Development secrets are never production secrets. The development database is a separate instance with synthetic data. The development Stripe account is a test account. There is no path from a developer’s laptop to production credentials.
Use direnv with .envrc files that fetch secrets from Vault instead of checking .env files into the repository. The .envrc file is gitignored, and it uses the developer’s personal Vault token to fetch secrets dynamically. This means secrets are never stored on disk in plaintext (they exist only in environment variables for the duration of the shell session), and revoking a developer’s Vault access immediately revokes their access to all secrets.
# .envrc (gitignored)
export VAULT_ADDR="https://vault.internal.harborsoftware.com"
export DATABASE_URL=$(vault kv get -field=url secret/dev/order-service/database)
export STRIPE_KEY=$(vault kv get -field=api_key secret/dev/order-service/stripe)
export REDIS_URL=$(vault kv get -field=url secret/dev/order-service/redis)
When a developer leaves the team, we revoke their Vault token, which immediately revokes access to all secrets across all environments. With environment variables stored in .env files, you would need to rotate every secret the developer had access to because you cannot revoke a static credential without changing it.
One operational consideration that is easy to overlook: backup your Vault data independently of your application backups. If Vault’s storage backend (Raft, Consul, or whatever you use) becomes corrupted and you cannot recover it, every service that depends on Vault will be unable to start because it cannot fetch its credentials. We run automated Vault snapshots every 6 hours using vault operator raft snapshot save, stored in a separate storage system (S3 with versioning enabled) that does not depend on Vault for access credentials. The snapshot encryption key is stored in a physical safe deposit box and in a separate cloud KMS, ensuring we can recover Vault even if Vault itself is completely destroyed. This may sound paranoid, but losing your secrets management system means losing access to every system it protects, which is effectively a total outage of all services.
We test our Vault recovery procedure quarterly by restoring a snapshot to a test environment and verifying that all secrets are accessible. The restore process takes approximately 15 minutes, which is our current RTO (recovery time objective) for Vault. If your team depends on a central secrets store, you should know your Vault RTO and test it regularly. The first time you need to recover Vault should not be during a production incident.
Secrets management is infrastructure work that does not produce visible features. It is easy to defer, easy to do halfway, and easy to regret when an incident occurs. If you take one thing from this post, let it be this: environment variables are a secrets transport mechanism, not a secrets management system. The difference matters, and it matters most when something goes wrong.