Incidents

Site returning 500 with healthy WAF: when the nginx backend disappears and the reverse proxy goes blind

Published on May 2, 2026

The symptom: 500 with apparently healthy infrastructure

Around 08:09 UTC the server restarted. By 13:35 UTC, almost five and a half hours later, the problem was still active: all visitors to a high-traffic content portal were receiving HTTP 500. No alert had fired for the WAF or the nginx container — both were running.

The server architecture has two distinct nginx layers running on the same host:

# Request flow
CDN (Akamai) → WAF container (network_mode: host, port 443)
             → proxy_pass https://172.28.5.61:443
             → nginx backend (bridge network, IP 172.28.5.61)
             → fastcgi → PHP-FPM

The WAF runs in network_mode host — it uses the VM's network interface directly, sees the real IPs of Akamai edge nodes, and proxies to the nginx backend which lives on an internal bridge network (172.28.5.61). PHP-FPM also runs in network_mode host to share the Unix socket with the nginx backend.

When the outer layer is healthy and the inner layer fails silently, traditional monitoring won't detect it. The WAF healthcheck returns 200. The inner layer that disappeared is invisible to anyone monitoring from outside.

The diagnosis: vhost with 0 bytes

The first check was the WAF error.log. The message was direct:

connect() failed (111: Connection refused) while connecting to upstream https://172.28.5.61:443/

The WAF was trying to proxy to the nginx backend at 172.28.5.61:443 and receiving connection refused. The nginx container was running — the process existed. But nginx was not listening on port 443 for that specific domain.

The cause became clear when checking the vhost file:

# Check vhost file size on the host
ls -la /etc/nginx/sites-enabled/blog.cliente-exemplo.com.br

# Output:
-rw-r--r-- 1 root root 0 Apr 13 08:09 blog.cliente-exemplo.com.br

Zero bytes. The vhost configuration file was completely empty. When nginx reloads with an empty vhost configuration file, it simply does not create the corresponding server block — no listen 443, no proxy_pass, nothing. nginx continues working normally for all other configured domains. But for this specific domain, port 443 does not exist.

The WAF, upon receiving the Akamai request for that domain, attempted the configured proxy_pass and found the port closed. Akamai received connection refused and surfaced it as ERR_READ_ERROR in visitors' browsers.

Why the vhost became empty

The server had restarted at 08:09 UTC. The vhost backup file existed in the server's home directory — created during a previous maintenance session:

# Backups available on the host
/etc/nginx/blog.cliente-exemplo.com.br.bak.20260319       # nginx vhost
/etc/nginx/blog.cliente-exemplo.com.br.waf-bak.20260319   # WAF vhost

The most likely hypothesis is that a maintenance operation during or before the restart overwrote the vhost file with empty content — either an accidental truncation or a failed deployment mid-write. The corresponding WAF file (waf-enabled/blog.cliente-exemplo.com.br) was intact — which explains why the WAF itself was healthy and accepting connections on port 443, but failing when trying to pass requests to the backend.

A healthy WAF masks a broken backend. From the CDN's perspective, the origin is responding — just with 502 or connection refused instead of 200. For external monitoring, the origin is the WAF, not the nginx backend.

The fix: restore from backup

The fix was straightforward — restore the vhost from backup and reload the nginx backend:

# Restore nginx vhost from backup
cp /etc/nginx/blog.cliente-exemplo.com.br.bak.20260319    /etc/nginx/sites-enabled/blog.cliente-exemplo.com.br

# Verify file is no longer empty
wc -c /etc/nginx/sites-enabled/blog.cliente-exemplo.com.br
# Expected output: something like "4321 /etc/nginx/sites-enabled/blog.cliente-exemplo.com.br"

# Test nginx config before reload
docker exec nginx nginx -t

# Reload nginx (no downtime)
docker exec nginx nginx -s reload

At 13:35 UTC, after the reload, the site returned to responding 200. The WAF started receiving valid responses from the backend. Visitors stopped seeing the error.

The architecture lesson: monitor each layer from the previous layer's perspective

This incident exposes a classic blind spot in multi-layer architectures: the healthcheck monitors the outermost layer, but does not validate that inner layers are working correctly.

In the diagram below, each arrow represents a dependency that can fail silently:

# Dependency chain — each layer can fail without the previous one knowing
[External monitoring] → checks if WAF responds on port 443
[WAF]                 → checks if nginx backend responds at 172.28.5.61:443
[nginx backend]       → checks if PHP-FPM processes fastcgi
[PHP-FPM]             → checks if database responds

# What was being monitored:
[External monitoring] → WAF port 443: OK (WAF responds, but returns 500 for the domain)

# What should be monitored:
[WAF]                 → nginx backend: FAIL (connection refused at 172.28.5.61:443)

The critical point is that monitoring needs to validate backend health from the WAF's perspective — not just WAF health from the external perspective. A healthcheck that curls the WAF's IP can return 200 while all users receive 500, if the WAF is returning its own error page.

How to detect this pattern before the incident

Two simple checks that detect the problem before visitors report it:

# 1. Check that all vhosts have non-zero size
find /etc/nginx/sites-enabled/ -maxdepth 1 -type f -empty
# If any file is returned: ALERT — empty vhost

# 2. Check if nginx backend is listening on expected ports
docker exec nginx ss -tlnp | grep :443
# If empty for a domain that should be active: PROBLEM

# 3. Test internal proxy directly (bypasses WAF)
curl -sk -H "Host: blog.cliente-exemplo.com.br" https://172.28.5.61/ -o /dev/null -w "%{http_code}
"
# If returns 000 or connection refused: nginx backend is not listening

The curl check directly on the nginx backend's internal IP is particularly useful because it simulates exactly what the WAF does when receiving a request — and fails in the same way, making the problem immediately visible.

What was corrected after the incident

Backups kept outside sites-enabled: .bak files must never live inside sites-enabled or waf-enabled — nginx loads all files from those directories and generates conflicting server_name warnings.

Internal backend monitoring: added healthcheck that verifies curl directly at 172.28.5.61 with Host header — detects missing listen 443 before the WAF starts returning connection refused to visitors.

Empty vhost check on deploy: any pipeline that writes files to sites-enabled must verify non-zero size before reloading nginx.

Layered infrastructure increases resilience — but also increases the distance between the failure point and the detection point. Each layer needs visibility into the health of the layer it depends on, not just its own health.

Site returning 500 with healthy WAF: when the nginx backend disappears and the reverse proxy goes blind

The symptom: 500 with apparently healthy infrastructure

The diagnosis: vhost with 0 bytes

Why the vhost became empty

The fix: restore from backup

The architecture lesson: monitor each layer from the previous layer's perspective

How to detect this pattern before the incident

What was corrected after the incident

ModSecurity blocked its own CDN: when the WAF doesn't know it's behind Akamai and bans the edge nodes

141 OWASP rules active, zero false positives: configuring OCI WAF for WordPress

504 with no high CPU, no queue, no RDS: when the infrastructure is green but the payment gateway stopped responding

`docker compose` with sed -i: why the config changed in the file but the container ignored it — and how the inode caused an OOM loop

Site returning 500 with healthy WAF: when the nginx backend disappears and the reverse proxy goes blind

The symptom: 500 with apparently healthy infrastructure

The diagnosis: vhost with 0 bytes

Why the vhost became empty

The fix: restore from backup

The architecture lesson: monitor each layer from the previous layer's perspective

How to detect this pattern before the incident

What was corrected after the incident

Related articles

ModSecurity blocked its own CDN: when the WAF doesn't know it's behind Akamai and bans the edge nodes

141 OWASP rules active, zero false positives: configuring OCI WAF for WordPress

504 with no high CPU, no queue, no RDS: when the infrastructure is green but the payment gateway stopped responding

`docker compose` with sed -i: why the config changed in the file but the container ignored it — and how the inode caused an OOM loop