131 PHP-FPM crashes in 10 minutes: how a crawler locked the server via systemd-coredump (not via PHP)
Published on April 27, 2026
The high CPU alert that wasn't a CPU attack
Grafana fired a high CPU alert on a WordPress publisher. I connected to the server and ran the initial diagnosis.
ps aux --sort=-%cpu | head -12The output showed 7 `systemd-coredump` processes at the top, collectively consuming 47% CPU. No PHP-FPM processes with high CPU. No abnormal nginx processes. The 'CPU attack' was the crash handler, not PHP.
When you see high CPU on systemd-coredump, the incident has already happened. systemd is compressing the remains of what crashed. The real question is: what generated 131 core dumps in 10 minutes?
The 131 core dumps and 4 GB of disk
Checking the core dump directory confirmed the scale of the problem:
ls -lh /var/lib/systemd/coredump/ | grep 'php-fpm' | wc -l
# 131
du -sh /var/lib/systemd/coredump/
# 4.0G
ls -lh /var/lib/systemd/coredump/ | grep 'php-fpm' | head -5
# core.php-fpm.33.abc1234.1234567890.xz 31M
# core.php-fpm.33.abc1235.1234567891.xz 31M
# core.php-fpm.33.abc1236.1234567892.xz 31M131 core dump files, each approximately 31 MB after compression by systemd-coredump, totaling 4 GB. All with UID 33 — which is www-data, the PHP-FPM worker user. Disk usage had jumped from 39% to 50% in under 10 minutes.
systemd-coredump compresses each core dump in the background using lz4 or xz. With 131 simultaneous crashes, there were 7 compression processes running in parallel, consuming nearly half the server's CPU capacity while trying to process the dump queue.
The root cause: crawler from range <crawler-range>/24
Investigation in the nginx error.log revealed the pattern:
grep '<crawler-prefix>' /var/log/nginx/error.log | head -10
# 2026/03/03 14:20:15 [error] upstream timed out (110) GET /post-about-dogs client: <crawler-IP-1>
# 2026/03/03 14:20:17 [error] upstream timed out (110) GET /another-post client: <crawler-IP-1>
# 2026/03/03 14:20:19 [error] upstream timed out (110) GET /tag/animals client: <crawler-IP-1>
# ...95 timeout entries from the same range
# 2026/03/03 14:33:04 [error] ModSecurity: Access denied 403 GET /.env client: <crawler-IP-2>Starting at 14:20, a crawler from the `<crawler-range>/24` range launched a parallel sweep against WordPress posts, tag pages (/tag/), and category pages (/categoria/). The site used GoCache CDN, but this category of URLs — tags and categories — had a cold cache. Each request reached PHP-FPM uncached.
Why PHP-FPM crashed
The WordPress publisher had an 8-vCPU server with PHP-FPM configured with multiple workers. Under bombardment of parallel requests to uncached pages, workers began colliding — multiple workers attempting to generate the same tag page with heavy content simultaneously, exhausting memory before finishing. A PHP-FPM worker that hits memory_limit generates SIGSEGV and produces a core dump.
The /.env access attempt at 14:33 confirmed the crawler's profile: it wasn't a legitimate indexing bot. It was reconnaissance scanning with credential exposure attempts.
The degradation loop
The sequence that led to load 15.18 on an 8-vCPU server:
14:20 — Crawler starts parallel requests to /tag/ and /category/
14:20-14:30 — PHP-FPM workers exhaust memory and crash (131 occurrences)
14:20+ — systemd-coredump generates 131 files of ~31 MB each = 4 GB on disk
14:30 — Load peaks at 15.18 (7x above normal ~0.5 for this server)
14:31 — Crawler backs off / workers stop crashing
14:38 — Load drops to 0.82
14:44 — Load normalized: 0.02
The response actions
1. Clear the core dumps (priority: free CPU and disk)
With 7 compression processes still running, the first action was to break the cycle and free resources:
# Remove all php-fpm core dumps
sudo rm -f /var/lib/systemd/coredump/core.php-fpm.*
# Result:
# Disk: 50% → 39%
# systemd-coredump CPU: 47% → 0%
# Load began dropping immediately2. Block the crawler range
# Block the attacker range via iptables
sudo iptables -I INPUT -s <CRAWLER_CIDR> -j DROP -m comment --comment 'crawler-ban: php-fpm crash 2026-03-03'
# Verify the rule was applied
sudo iptables -L INPUT -n | grep '<crawler-prefix>'The server normalized completely within 6 minutes after clearing the dumps and banning the range. Load from 15.18 to 0.02. PHP-FPM with 10 healthy workers. Site responding in 0.32s.
The correct diagnosis vs the initial diagnosis
The alert said 'high CPU'. The natural initial diagnosis would be: volumetric attack, stuck process, PHP consuming CPU. None of those were right.
# Process state during the incident (reconstructed from logs)
# Top 5 by CPU at peak (14:30):
#
# PID USER %CPU COMMAND
# 12341 systemd 9.2 systemd-coredump ← compressing dump 1
# 12342 systemd 8.8 systemd-coredump ← compressing dump 2
# 12343 systemd 8.7 systemd-coredump ← compressing dump 3
# 12344 systemd 7.9 systemd-coredump ← compressing dump 4
# 12345 systemd 7.4 systemd-coredump ← compressing dump 5
# ...
# No php-fpm processes with high CPU — because they had already crashedThe high CPU was in the crash handler, not in PHP. This is a common misdiagnosis pattern: systemd-coredump processes dumps in the background and appears at the top of ps/top as if it were the attacker, when in reality it's the cleanup service. The attacker has already finished its work.
Why cold cache on /tag/ and /category/ was decisive
GoCache CDN was configured in forward mode for these URLs. Under normal conditions, tag and category pages are accessed organically by few users and the cache stays warm. A crawler accessing hundreds of unique tag URLs in parallel won't find cache — each URL is new to the CDN.
With cold cache, each request reached PHP-FPM. A WordPress tag page can be heavy — multiple database queries (posts in the tag, sidebar, related content), PHP rendering a full template. Under 50+ parallel requests to different tags, memory pressure on workers is proportional to the number of simultaneous requests.
The combination of aggressive crawler + cold cache + enabled systemd-coredump creates a silent failure that looks like a CPU attack but is actually a crash cascade. The server isn't overloaded by traffic — it's overloaded cleaning up the crash aftermath.
Missing protections and what to implement
This server had ModSecurity active (the /.env access was blocked with 403), but the following protections were absent or disabled:
Bot mitigation on GoCache: status false. With bot mitigation active, the crawler would have been blocked at the edge before reaching the origin.
Rate limiting on GoCache: status false. Per-IP request throttling at the edge is the first line of defense against aggressive crawlers.
CrowdSec bouncer: Docker container running, but host bouncer inactive. CrowdSec detects the sweep pattern and bans automatically — without an active bouncer, detection doesn't produce blocking.
PHP-FPM pm.max_children: no limit configured to prevent a crash loop from exhausting resources. Also configuring SystemMaxUse in coredump.conf limits disk impact.
# Limit total core dump size in systemd
# /etc/systemd/coredump.conf
[Coredump]
Storage=external
Compress=yes
ProcessSizeMax=2G
ExternalSizeMax=2G
MaxUse=1G # max 1 GB total in /var/lib/systemd/coredump/
KeepFree=1G # keep at least 1 GB free on the filesystem
# Apply without reboot:
sudo systemctl daemon-reloadWith MaxUse=1G, systemd-coredump automatically discards old dumps when the limit is reached — preventing a 131-crash attack from filling the disk and prolonging the crisis with compression CPU.
Final state and lesson
Six minutes after identifying the real cause, the server was normalized. The sequence:
rm -f /var/lib/systemd/coredump/core.php-fpm.* — 4 GB freed, CPU normalized
iptables -I INPUT -s <crawler-range>/24 -j DROP — crawler blocked
The server returned to its pre-incident state with no restart required. The healthy PHP-FPM workers that hadn't crashed continued serving traffic normally throughout the entire response process.
When ps/top shows systemd-coredump at the top with high CPU, don't try to kill systemd-coredump. Identify the process that crashed (the dump's UID points to the user), find out what caused the crash, and only then clean up the dumps. Killing the handler without understanding the cause leaves the disk full and the incident undiagnosed.
