Resque queue with 1,480 stuck jobs: diagnosing job accumulation when the worker is alive and Redis is healthy
Published on April 29, 2026
The alert that never came
A fintech client reported that the delivery tracking queue was accumulating. The screenshot showed the picture:
| Queue | Jobs | Status |
|------------|-------|--------------|
| postbacks | 0 | Empty |
| payments | 0 | Empty |
| test_queue | 0 | Empty |
| tracking | 1480 | Accumulating |
| default | 18 | Accumulating |
| sales | 7 | Processing |First check: the worker process was alive on both ASG instances. The systemd service reported 'active (running)'. Redis (ElastiCache) responded normally. No errors in the logs. From the monitors' perspective, everything was green.
A queue accumulating with a live worker and healthy Redis is one of the most treacherous situations in queue systems. No alarm is flashing, no dead process, no stack trace. The only evidence is the job count growing silently.
The diagnosis: simple math revealed the structural problem
The worker was configured to process 6 queues in a single sequential process. The 'tracking' queue fires jobs that make one HTTP request to the postal tracking API per order — a blocking network call, no parallelism.
The math was straightforward:
Trigger (EventBridge): every 15 minutes
Jobs queued per cycle: ~1,480
Available workers: 2 (1 per ASG instance, baked into the AMI)
Time per job: ~1.5s (1 HTTP request to external API)
Time to drain 1,480 jobs with 2 workers:
1,480 / 2 / 1 job at a time = 740 jobs per worker
740 × 1.5s = ~18 minutes
Next trigger arrives in: 15 minutes
Deficit per cycle: +3 minutes of accumulationThe worker could never finish one cycle before the next one started. The queue wasn't stuck — it was accumulating structurally, one 15-minute cycle behind the other.
Why the other queues suffered too
The single-thread worker processed all 6 queues in sequential order: postbacks → payments → test_queue → tracking → default → sales. When it reached 'tracking' with 1,480 jobs, the worker got stuck there for ~18 minutes making blocking HTTP calls.
During that time, new jobs in 'default' and 'sales' queues were waiting. There was no critical starvation in this case because the other queues had low volume, but the pattern is dangerous: a slow queue with blocking I/O holds up all queues that come after it in the sequence.
This is head-of-line blocking in queue systems: the slowest job at the front of the processing sequence blocks everything behind it, even if the subsequent jobs are fast.
How to identify the pattern: diagnostic commands
To reach this diagnosis, the path was to check worker state, queue depth, and logs in sequence:
# 1. Workers registered in Redis
redis-cli -h <redis-endpoint> smembers resque:workers
# Output: hostname:PID:queue1,queue2,...,queueN
# A single worker listing all queues = single-thread
# 2. Depth of each queue
redis-cli -h <redis-endpoint> llen resque:queue:tracking
redis-cli -h <redis-endpoint> llen resque:queue:default
redis-cli -h <redis-endpoint> llen resque:queue:sales
# 3. Failed jobs
redis-cli -h <redis-endpoint> llen resque:failed
redis-cli -h <redis-endpoint> lrange resque:failed 0 -1
# 4. Worker process on the host
ps aux | grep resque | grep -v grep
sudo systemctl status app-worker.serviceThe key point: a worker registered as 'hostname:PID:postbacks,payments,test_queue,tracking,default,sales' is processing all those queues sequentially, in a single thread. If any queue has slow jobs, all the others wait.
The observability gap: no queue depth alarm
The biggest operational problem wasn't the accumulation itself — it was not knowing it was accumulating. The environment had CPU, memory and HTTP error alarms, but no queue depth alarm in CloudWatch.
To publish queue metrics to CloudWatch via a periodic collection script:
#!/usr/bin/env python3
import boto3
import redis
import time
r = redis.Redis(host='<redis-endpoint>', port=6379)
cw = boto3.client('cloudwatch', region_name='us-east-1')
queues = ['tracking', 'default', 'sales', 'postbacks']
for queue in queues:
depth = r.llen(f'resque:queue:{queue}')
cw.put_metric_data(
Namespace='App/Queues',
MetricData=[{
'MetricName': 'QueueDepth',
'Dimensions': [{'Name': 'QueueName', 'Value': queue}],
'Value': depth,
'Unit': 'Count',
'Timestamp': time.time()
}]
)
print(f'{queue}: {depth} jobs')With metrics in CloudWatch, the alarm is trivial: depth > 500 jobs in the 'tracking' queue for more than 10 minutes → alert. Without it, accumulation is only discovered when someone manually checks the dashboard.
The solutions: from quick fix to correct
Option 1 — Dedicated worker per slow queue (no code changes)
The fastest solution for production: create a separate systemd service for the slow queue, with the QUEUE environment variable set to process only that queue.
# /etc/systemd/system/app-worker-tracking.service
[Unit]
Description=App Queue Worker — postal tracking Only
After=docker.service
Requires=docker.service
[Service]
Type=simple
Restart=always
RestartSec=10
ExecStart=/usr/bin/docker exec myapp-php php app/Workers/resque-worker.php
Environment="QUEUE=tracking"
Environment="REDIS_HOST=<redis-endpoint>"
Environment="REDIS_PORT=6379"
[Install]
WantedBy=multi-user.targetWith 2 ASG instances and 1 dedicated worker per instance: 1,480 / 2 workers / 1.5s = ~11 minutes. Still close to the 15-minute limit, but it solves the blocking problem for the other queues. For more headroom, increase the ASG minimum from 2 to 3 instances.
Option 2 — Job-level parallelism (code change)
The correct medium-term solution: batch the orders when enqueuing and use curl_multi_exec (PHP) or asyncio/aiohttp (Python) to fire N requests in parallel inside a single job.
With batches of 10 and real parallelism: 148 jobs instead of 1,480, each job making 10 requests in parallel. Total time with 2 workers: under 2 minutes for a full cycle.
The pattern applies beyond Resque
This diagnosis applies to any queue system where a worker processes multiple queues sequentially with blocking I/O:
Sidekiq (Ruby): worker with multiple queues in the same thread — a queue with slow external API jobs holds up all the others.
BullMQ (Node.js): worker without configured concurrency (default 1) in a queue with awaited HTTP calls — same problem, different runtime.
AWS SQS + Lambda: this problem doesn't exist by design (Lambda scales horizontally by default), but SQS queues processed by single-thread EC2 replicate the pattern.
The rule is simple: if your worker makes external network calls and processes multiple queues in the same thread, any queue with a slow API will hold up all the others. Queue depth monitoring is not optional — it's the only signal that arrives before accumulation becomes an incident.
Operational lesson
Three configuration changes that eliminate this class of problem:
1. Dedicated worker per queue with slow external I/O. Never mix blocking external API queues with fast-processing queues in the same worker.
2. Queue depth alarm in CloudWatch. Threshold: any main queue with depth > N for more than 1 trigger cycle → immediate alert.
3. ASG minimum compatible with job volume. If each instance has 1 worker and the job cycle requires N workers to drain within the trigger interval, the ASG minimum needs to be N.
