Buzeli
buzeliSoluções Digitais
DevSecOps

PHP 7.4 → 8.4 in production on ASG with zero downtime: the AMI bake process that works (and the 2 mistakes we learned from)

Published on April 30, 2026

The context: PHP in containers, worker baked into the AMI

The stack ran PHP-FPM in Docker containers inside ARM EC2 instances in the ASG. Three pools: pool0 (port 9000), pool1 (9001), pool2 (9002). The queue worker also ran as a systemd service inside one of the instances — everything baked into the Launch Template AMI.

Upgrade from PHP 7.4.34 to 8.4.12. A major version with known breaking changes. The application had already been tested by the development team in a staging environment — our part was executing the upgrade in production without downtime.

An important constraint learned in a previous session: manual Instance Refresh can only be executed during daytime hours. With a minimum of 1 instance at night, any refresh replaces the only active instance — causing ~2 minutes of downtime. During the day, with a minimum of 2 instances and MinHealthyPercentage=100, the ASG replaces one at a time while maintaining availability. We documented this in another post about the nighttime deploy incident.

The AMI bake process: the 8 steps

The standard process for any infrastructure change in this environment is always the same — never modify an instance that's in the ASG receiving ALB traffic:

Copy
1. Get AMI and configs from the default Launch Template (LT)
2. Launch an isolated temporary instance from that AMI
   (outside the ASG, outside the ALB target group)
3. Apply changes to the temporary instance and validate
4. Create new AMI from the temporary instance (--no-reboot)
5. Terminate the temporary instance
6. Create new Launch Template version with the new AMI
   (source = current default version, only ImageId changes)
7. Set new version as default in the LT
8. Start Instance Refresh on the ASG
   (MinHealthyPercentage=100, InstanceWarmup=120)

The temporary instance never joins the ALB. All changes are tested on it before becoming an AMI. If something goes wrong, production is unaffected — the temporary instance is simply terminated.

Copy
# Launch temporary instance from the default LT AMI
TEMP_ID=$(aws ec2 run-instances --profile app-profile \
  --image-id <ami-lt-default> \
  --instance-type c6g.2xlarge \
  --key-name app-key \
  --security-group-ids <sg-id> \
  --subnet-id <private-subnet> \
  --iam-instance-profile Arn=<iam-profile-arn> \
  --tag-specifications 'ResourceType=instance,Tags=[{Key=Name,Value=app-temp-bake}]' \
  --query 'Instances[0].InstanceId' --output text)

# Wait for 2/2 health checks (don't use wait instance-running)
aws ec2 wait instance-status-ok --profile app-profile --instance-ids $TEMP_ID

# Instance Refresh (execute only during daytime hours)
aws autoscaling start-instance-refresh --profile app-profile \
  --auto-scaling-group-name app-asg \
  --preferences '{"InstanceWarmup":120,"MinHealthyPercentage":100}' \
  --query 'InstanceRefreshId' --output text

Bake 1: the main upgrade

The operator applied the PHP upgrade manually on the temporary instance and validated the containers:

Copy
$ docker exec php-fpm-pool0 php --version
PHP 8.4.12 (cli) (built: Aug 28 2025 18:47:43) (NTS)
Zend Engine v4.4.12
  with Zend OPcache v8.4.12

$ docker ps --format 'table {{.Names}}\t{{.Image}}\t{{.Status}}' | grep php
php-fpm-pool2   php-fpm-custom:8.4.12   Up 3 minutes
php-fpm-pool0   php-fpm-custom:8.4.12   Up 3 minutes
php-fpm-pool1   php-fpm-custom:8.4.12   Up 3 minutes

AMI created, temporary instance terminated, new LT version created, Instance Refresh started at 19:14 -03:00. Completed at 19:22 — exactly 8 minutes. Two production instances confirmed running PHP 8.4.12.

But when checking the CloudWatch logs in the following hours, we found that PHP-FPM error logs weren't arriving. The pools were running, but the log group was silent.

Bake 1 mistake: PHP-FPM logs without write permission

Investigation showed the problem: the '/var/log/php-fpm/' directory belonged to 'apache:root' with permission 'drwxrwx---'. The PHP-FPM process ran as user 'www-data', which was in the 'others' category — no write access to the directory.

The pools had 'php_admin_value[error_log]' configured for '/var/log/php-fpm/pool{0,1,2}.log', but the files couldn't be created because the directory wasn't accessible. PHP errors were silently discarded.

Copy
# Problematic state (after Bake 1)
$ ls -la /var/log/ | grep php-fpm
drwxrwx---  2 apache root 4096 Mar 28 10:00 php-fpm/
# www-data has no access (is in "others" = ---)

# Fix required for Bake 2
sudo chgrp www-data /var/log/php-fpm/
sudo chmod g+rwx /var/log/php-fpm/
sudo touch /var/log/php-fpm/pool0.log \
           /var/log/php-fpm/pool1.log \
           /var/log/php-fpm/pool2.log
sudo chown www-data:www-data /var/log/php-fpm/pool*.log
This mistake doesn't show up in staging if the staging environment uses different permission settings. It's the kind of problem that only appears in production — and only after a bake.

Bake 2: the permission and CloudWatch fix

Second temporary instance, this time from the Bake 1 AMI (which already had PHP 8.4.12). Changes were applied in sequence:

1. Fix permissions for /var/log/php-fpm/ (chgrp www-data + chmod g+rwx + create pool*.log files with www-data owner)

2. Update CloudWatch agent configuration to collect the three pool log files

3. Validate actual file writes before creating the AMI (docker exec sh -c 'echo x >> /var/log/php-fpm/pool0.log')

The CloudWatch agent configuration was updated to include the three pools. A critical detail: never use the CloudWatch agent 'fetch-config' command to update the configuration — it overwrites the file with a duplicate name and the agent stops silently. The correct approach is to write the JSON directly to the file and restart via systemctl.

Copy
# WRONG: fetch-config destroys the existing configuration
# sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl \
#   -a fetch-config -m ec2 -s -c file:/path/config.json  # DO NOT USE

# CORRECT: write directly to the file + restart the service
sudo python3 -c "
import json
config = {
    'logs': {
        'logs_collected': {
            'files': {
                'collect_list': [
                    {
                        'file_path': '/var/log/php-fpm/pool0.log',
                        'log_group_name': 'php-fpm-errors',
                        'log_stream_name': '{instance_id}-pool0'
                    },
                    {
                        'file_path': '/var/log/php-fpm/pool1.log',
                        'log_group_name': 'php-fpm-errors',
                        'log_stream_name': '{instance_id}-pool1'
                    },
                    {
                        'file_path': '/var/log/php-fpm/pool2.log',
                        'log_group_name': 'php-fpm-errors',
                        'log_stream_name': '{instance_id}-pool2'
                    }
                ]
            }
        }
    }
}
path = '/opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.d/file_amazon-cloudwatch-agent.json'
open(path, 'w').write(json.dumps(config, indent=2))
print('OK')
"
sudo systemctl restart amazon-cloudwatch-agent
sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl -m ec2 -a status

Bake 2 Instance Refresh started at 19:34, completed at 19:41 — another 7 minutes. PHP-FPM logs arriving in CloudWatch under streams '{instance_id}-pool0', '{instance_id}-pool1', '{instance_id}-pool2'.

Final state: the Launch Template history

At the end of the session, the Launch Template history documented the path taken:

Copy
| Version | Contents                                     | Status         |
|---------|----------------------------------------------|----------------|
| v606    | PHP 7.4.34 (previous base)                   | Discarded      |
| v607    | PHP 8.4.12 (Bake 1 — missing log permissions)| Discarded      |
| v608    | PHP 8.4.12 + permissions + CW logs (Bake 2)  | CURRENT DEFAULT|

Post-refresh production instances: two ARM instances running PHP 8.4.12 with error logs arriving in CloudWatch. Active monitoring in the following hours to detect PHP 8 compatibility errors the application might generate in production.

The 2 mistakes and what they teach

Mistake 1: /var/log/php-fpm/ permissions

The log directory was created with 'apache:root' ownership at some point in history. The PHP-FPM process (www-data) was never able to write. The error was silent: no failure message, just absence of logs.

Lesson: when configuring any log directory for a process that runs under a different user than the directory owner, explicitly validate that the process can write before baking. The test is simple: 'docker exec sh -c "echo x >> file.log"' and verify there's no permission error.

Mistake 2: fetch-config destroys the CloudWatch agent configuration

The 'amazon-cloudwatch-agent-ctl -a fetch-config' command doesn't update the configuration — it creates a new file with a duplicate name ('file_file_name.json'), can leave the original file with 0 bytes, and the agent stops silently. The status shows as 'stopped' with no clear error message.

Lesson: to update the CloudWatch agent configuration, write directly to the existing JSON file and restart the service via systemctl. Validate with '-a status' before creating the AMI.

Why 2 bakes is normal, not a sign of a problem

Each bake cycle reveals a layer the previous one didn't anticipate. The first bake covers the main change. The second covers what the first's validation revealed. In major version upgrades, two cycles is the standard — not the exception. The process exists precisely to absorb these discoveries before they affect production.

The total cost: two Instance Refreshes of ~8 minutes each, two temporary instances that existed for under 20 minutes. No downtime. No lost transactions.