Incidents

Late-night deploy took down the fintech: why ASG min=1 and always-on CI/CD are incompatible

Published on April 22, 2026

The client reported: 'the app went down and came back on its own'

The fintech woke up to a notification at 00:13 BRT: the app had been down for a few minutes and recovered on its own. No alarm configured for 5xx errors, no unhealthy instance alert — the discovery came from the user themselves.

In the next morning's investigation, the root cause was clear in under 30 minutes by cross-referencing three sources: CloudWatch ALB metrics, Auto Scaling Group events, and the GitHub Actions run history.

The exact incident timeline

All times below are in BRT (UTC-3):

20:34 (Mar 10) — Previous deploy ran without issues. At that time the ASG still operated with 2 instances.

21:00 — Scheduled Action asg-scale-nighttime reduces min capacity from 2 to 1.

21:01 — One instance is terminated. ASG goes from 2 to 1 instance (i-abc1234f).

00:02 (Mar 11) — Dev pushes to main branch: 'add producer 50% commission'. GitHub Actions starts the pipeline.

00:11:01 — Pipeline creates a new Launch Template version with the updated AMI.

00:11:09 — Instance Refresh starts. The only running instance is removed from the target group and replaced.

00:11 — 00:13 — ALB has no healthy targets. 92 HTTP 5xx errors returned: 49 in the first minute, 42 in the second.

00:13:16 — New instance completes warmup and becomes healthy. Service restored.

01:06 (Mar 11) — Second push: 'adjusting invoice emission'. New deploy — but this time the instance had already been replaced, so the Instance Refresh caused no new downtime.

Two deploys in the same night. The first one took the service down. The second passed without impact — by luck, not by design.

Why MinHealthyPercentage did not protect

The Instance Refresh was configured with MinHealthyPercentage: 100 — which in theory should ensure that at least 100% of instances are healthy during the replacement. In practice, with only 1 instance, this setting is mathematically impossible to honor.

To replace an instance in an ASG with min=1, AWS must terminate the existing instance before launching the new one. There is no way to keep 100% of 1 instance healthy during the swap — it's zero or one, and during the new instance's warmup the ALB has no targets.

# Instance Refresh configuration at the time of the incident
MinHealthyPercentage: 100   # protective only with 2+ instances
InstanceWarmup: 120         # 2-minute warmup (the actual downtime was ~2 minutes)
AutoRollback: false         # no automatic rollback

The 120-second InstanceWarmup also confirms the incident: the downtime lasted exactly the warmup time of the new instance — 2 minutes and 7 seconds, from 00:11:09 to 00:13:16 BRT.

The 'nighttime cost optimization' trap

The Scheduled Action reducing the ASG to min=1 at 21:00 is a legitimate optimization. An idle c6g.2xlarge instance costs money that nobody needs to spend when traffic is low. The monthly savings from this configuration can reach $150-200.

The problem is not the optimization itself. The problem is that the CI/CD pipeline has no awareness of the current state of the ASG. It does not know whether 1 or 4 instances are running. It fires the Instance Refresh regardless — at any time, under any condition.

Reduced cost + always-on CI/CD = guaranteed downtime window every night between 21:00 and 07:00 BRT. Not a possibility — a mathematical certainty.

Evidence in the logs

To reproduce the diagnosis and confirm the root cause in any ASG with this pattern:

# 1. Check ASG events during the incident window
aws autoscaling describe-scaling-activities \
  --auto-scaling-group-name <asg-name> \
  --max-items 20

# 2. Check recent Instance Refreshes
aws autoscaling describe-instance-refreshes \
  --auto-scaling-group-name <asg-name> \
  --max-records 5

# 3. Check ALB 5xx errors per minute
aws cloudwatch get-metric-statistics \
  --namespace AWS/ApplicationELB \
  --metric-name HTTPCode_ELB_5XX_Count \
  --dimensions Name=LoadBalancer,Value=<lb-arn> \
  --start-time 2026-03-11T03:00:00Z \
  --end-time 2026-03-11T04:00:00Z \
  --period 60 --statistics Sum

# 4. Check HealthyHostCount in the same period
aws cloudwatch get-metric-statistics \
  --namespace AWS/ApplicationELB \
  --metric-name HealthyHostCount \
  --dimensions Name=TargetGroup,Value=<tg-arn> \
             Name=LoadBalancer,Value=<lb-arn> \
  --start-time 2026-03-11T03:00:00Z \
  --end-time 2026-03-11T04:00:00Z \
  --period 60 --statistics Minimum

The HealthyHostCount data will show the drop from 1 to 0 hosts in the minute of the Instance Refresh — confirming the ALB had no targets. UnhealthyHostCount stays at 0 throughout because the instance was removed from the target group (not marked unhealthy), which makes standard 'unhealthy hosts' monitoring useless for detecting this type of incident.

Two remediation options

Option 1 (zero cost): block deploys during nighttime hours

The simplest approach: add a condition to the GitHub Actions workflow that aborts the pipeline if the current time is within the risk window.

# .github/workflows/ci-cd-main.yml
jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
      - name: Check deploy window
        run: |
          HOUR=$(TZ=America/Sao_Paulo date +%H)
          # Block between 21:00 and 07:00 BRT
          if [ "$HOUR" -ge 21 ] || [ "$HOUR" -lt 7 ]; then
            echo "Deploy blocked outside maintenance window (21:00-07:00 BRT)"
            echo "ASG operates with min=1 during this period. Rebase and run during business hours."
            exit 1
          fi

This option has zero cost and forces the team to defer nighttime deploys. The trade-off is that real emergencies may need a manual bypass — which is acceptable if there is a defined process for that.

Option 2 (full protection): check DesiredCapacity before Instance Refresh

A more surgical approach: the pipeline checks the ASG DesiredCapacity before starting the Instance Refresh. If it is 1, the pipeline aborts or waits for a scale-up before proceeding.

# Pipeline step — verify ASG capacity before deploy
- name: Check ASG capacity
  run: |
    DESIRED=$(aws autoscaling describe-auto-scaling-groups \
      --auto-scaling-group-names <asg-name> \
      --query 'AutoScalingGroups[0].DesiredCapacity' \
      --output text)

    if [ "$DESIRED" -lt 2 ]; then
      echo "ERROR: ASG has only $DESIRED instance(s). Instance Refresh would cause downtime."
      echo "Wait for the daytime scale-up or run the deploy manually with 2+ instances."
      exit 1
    fi

    echo "ASG has $DESIRED instances — deploy safe to proceed."

This option is more robust because it works regardless of the time of day — it protects against any condition where the ASG has insufficient capacity, not just during the nighttime window.

The lesson

The incident was not anyone's fault. It was a collision of two reasonable decisions — reducing instances at night to save money, and running continuous CI/CD on main — that created a silent conflict. The ASG has no way to notify the pipeline. The pipeline has no way to query the ASG. Without an explicit mechanism connecting the two, downtime is only a matter of time.

MinHealthyPercentage: 100 protects when you have 2 instances. With 1, it is security theater. The only real protection is preventing the deploy from happening when there is no redundancy.

The CloudWatch alarm for HTTPCode_ELB_5XX_Count > 10 also came as a lesson: this incident was discovered by the user, not by an alert. With a basic alarm configured, the team would have been notified in under 1 minute.

Late-night deploy took down the fintech: why ASG min=1 and always-on CI/CD are incompatible

The client reported: 'the app went down and came back on its own'

The exact incident timeline

Why MinHealthyPercentage did not protect

The 'nighttime cost optimization' trap

Evidence in the logs

Two remediation options

Option 1 (zero cost): block deploys during nighttime hours

Option 2 (full protection): check DesiredCapacity before Instance Refresh

The lesson

CURLOPT_TIMEOUT = 0: the infinite timeout freezing the payment gateway for 60 seconds

504 with no high CPU, no queue, no RDS: when the infrastructure is green but the payment gateway stopped responding

PHP 7.4 → 8.4 in production on ASG with zero downtime: the AMI bake process that works (and the 2 mistakes we learned from)

Resque queue with 1,480 stuck jobs: diagnosing job accumulation when the worker is alive and Redis is healthy

Late-night deploy took down the fintech: why ASG min=1 and always-on CI/CD are incompatible

The client reported: 'the app went down and came back on its own'

The exact incident timeline

Why MinHealthyPercentage did not protect

The 'nighttime cost optimization' trap

Evidence in the logs

Two remediation options

Option 1 (zero cost): block deploys during nighttime hours

Option 2 (full protection): check DesiredCapacity before Instance Refresh

The lesson

Related articles

CURLOPT_TIMEOUT = 0: the infinite timeout freezing the payment gateway for 60 seconds

504 with no high CPU, no queue, no RDS: when the infrastructure is green but the payment gateway stopped responding

PHP 7.4 → 8.4 in production on ASG with zero downtime: the AMI bake process that works (and the 2 mistakes we learned from)

Resque queue with 1,480 stuck jobs: diagnosing job accumulation when the worker is alive and Redis is healthy