SRE

Resque queue with 1,480 stuck jobs: diagnosing job accumulation when the worker is alive and Redis is healthy

Published on April 29, 2026

The alert that never came

A fintech client reported that the delivery tracking queue was accumulating. The screenshot showed the picture:

| Queue      | Jobs  | Status       |
|------------|-------|--------------|
| postbacks  |     0 | Empty        |
| payments    |     0 | Empty        |
| test_queue |     0 | Empty        |
| tracking   |  1480 | Accumulating |
| default    |    18 | Accumulating |
| sales      |     7 | Processing   |

First check: the worker process was alive on both ASG instances. The systemd service reported 'active (running)'. Redis (ElastiCache) responded normally. No errors in the logs. From the monitors' perspective, everything was green.

A queue accumulating with a live worker and healthy Redis is one of the most treacherous situations in queue systems. No alarm is flashing, no dead process, no stack trace. The only evidence is the job count growing silently.

The diagnosis: simple math revealed the structural problem

The worker was configured to process 6 queues in a single sequential process. The 'tracking' queue fires jobs that make one HTTP request to the postal tracking API per order — a blocking network call, no parallelism.

The math was straightforward:

Trigger (EventBridge): every 15 minutes
Jobs queued per cycle: ~1,480
Available workers: 2 (1 per ASG instance, baked into the AMI)
Time per job: ~1.5s (1 HTTP request to external API)

Time to drain 1,480 jobs with 2 workers:
  1,480 / 2 / 1 job at a time = 740 jobs per worker
  740 × 1.5s = ~18 minutes

Next trigger arrives in: 15 minutes
Deficit per cycle: +3 minutes of accumulation

The worker could never finish one cycle before the next one started. The queue wasn't stuck — it was accumulating structurally, one 15-minute cycle behind the other.

Why the other queues suffered too

The single-thread worker processed all 6 queues in sequential order: postbacks → payments → test_queue → tracking → default → sales. When it reached 'tracking' with 1,480 jobs, the worker got stuck there for ~18 minutes making blocking HTTP calls.

During that time, new jobs in 'default' and 'sales' queues were waiting. There was no critical starvation in this case because the other queues had low volume, but the pattern is dangerous: a slow queue with blocking I/O holds up all queues that come after it in the sequence.

This is head-of-line blocking in queue systems: the slowest job at the front of the processing sequence blocks everything behind it, even if the subsequent jobs are fast.

How to identify the pattern: diagnostic commands

To reach this diagnosis, the path was to check worker state, queue depth, and logs in sequence:

# 1. Workers registered in Redis
redis-cli -h <redis-endpoint> smembers resque:workers
# Output: hostname:PID:queue1,queue2,...,queueN
# A single worker listing all queues = single-thread

# 2. Depth of each queue
redis-cli -h <redis-endpoint> llen resque:queue:tracking
redis-cli -h <redis-endpoint> llen resque:queue:default
redis-cli -h <redis-endpoint> llen resque:queue:sales

# 3. Failed jobs
redis-cli -h <redis-endpoint> llen resque:failed
redis-cli -h <redis-endpoint> lrange resque:failed 0 -1

# 4. Worker process on the host
ps aux | grep resque | grep -v grep
sudo systemctl status app-worker.service

The key point: a worker registered as 'hostname:PID:postbacks,payments,test_queue,tracking,default,sales' is processing all those queues sequentially, in a single thread. If any queue has slow jobs, all the others wait.

The observability gap: no queue depth alarm

The biggest operational problem wasn't the accumulation itself — it was not knowing it was accumulating. The environment had CPU, memory and HTTP error alarms, but no queue depth alarm in CloudWatch.

To publish queue metrics to CloudWatch via a periodic collection script:

#!/usr/bin/env python3
import boto3
import redis
import time

r = redis.Redis(host='<redis-endpoint>', port=6379)
cw = boto3.client('cloudwatch', region_name='us-east-1')

queues = ['tracking', 'default', 'sales', 'postbacks']

for queue in queues:
    depth = r.llen(f'resque:queue:{queue}')
    cw.put_metric_data(
        Namespace='App/Queues',
        MetricData=[{
            'MetricName': 'QueueDepth',
            'Dimensions': [{'Name': 'QueueName', 'Value': queue}],
            'Value': depth,
            'Unit': 'Count',
            'Timestamp': time.time()
        }]
    )
    print(f'{queue}: {depth} jobs')

With metrics in CloudWatch, the alarm is trivial: depth > 500 jobs in the 'tracking' queue for more than 10 minutes → alert. Without it, accumulation is only discovered when someone manually checks the dashboard.

The solutions: from quick fix to correct

Option 1 — Dedicated worker per slow queue (no code changes)

The fastest solution for production: create a separate systemd service for the slow queue, with the QUEUE environment variable set to process only that queue.

# /etc/systemd/system/app-worker-tracking.service
[Unit]
Description=App Queue Worker — postal tracking Only
After=docker.service
Requires=docker.service

[Service]
Type=simple
Restart=always
RestartSec=10
ExecStart=/usr/bin/docker exec myapp-php php app/Workers/resque-worker.php
Environment="QUEUE=tracking"
Environment="REDIS_HOST=<redis-endpoint>"
Environment="REDIS_PORT=6379"

[Install]
WantedBy=multi-user.target

With 2 ASG instances and 1 dedicated worker per instance: 1,480 / 2 workers / 1.5s = ~11 minutes. Still close to the 15-minute limit, but it solves the blocking problem for the other queues. For more headroom, increase the ASG minimum from 2 to 3 instances.

Option 2 — Job-level parallelism (code change)

The correct medium-term solution: batch the orders when enqueuing and use curl_multi_exec (PHP) or asyncio/aiohttp (Python) to fire N requests in parallel inside a single job.

With batches of 10 and real parallelism: 148 jobs instead of 1,480, each job making 10 requests in parallel. Total time with 2 workers: under 2 minutes for a full cycle.

The pattern applies beyond Resque

This diagnosis applies to any queue system where a worker processes multiple queues sequentially with blocking I/O:

Sidekiq (Ruby): worker with multiple queues in the same thread — a queue with slow external API jobs holds up all the others.

BullMQ (Node.js): worker without configured concurrency (default 1) in a queue with awaited HTTP calls — same problem, different runtime.

AWS SQS + Lambda: this problem doesn't exist by design (Lambda scales horizontally by default), but SQS queues processed by single-thread EC2 replicate the pattern.

The rule is simple: if your worker makes external network calls and processes multiple queues in the same thread, any queue with a slow API will hold up all the others. Queue depth monitoring is not optional — it's the only signal that arrives before accumulation becomes an incident.

Operational lesson

Three configuration changes that eliminate this class of problem:

1. Dedicated worker per queue with slow external I/O. Never mix blocking external API queues with fast-processing queues in the same worker.

2. Queue depth alarm in CloudWatch. Threshold: any main queue with depth > N for more than 1 trigger cycle → immediate alert.

3. ASG minimum compatible with job volume. If each instance has 1 worker and the job cycle requires N workers to drain within the trigger interval, the ASG minimum needs to be N.

Resque queue with 1,480 stuck jobs: diagnosing job accumulation when the worker is alive and Redis is healthy

The alert that never came

The diagnosis: simple math revealed the structural problem

Why the other queues suffered too

How to identify the pattern: diagnostic commands

The observability gap: no queue depth alarm

The solutions: from quick fix to correct

Option 1 — Dedicated worker per slow queue (no code changes)

Option 2 — Job-level parallelism (code change)

The pattern applies beyond Resque

Operational lesson

504 with no high CPU, no queue, no RDS: when the infrastructure is green but the payment gateway stopped responding

Late-night deploy took down the fintech: why ASG min=1 and always-on CI/CD are incompatible

CURLOPT_TIMEOUT = 0: the infinite timeout freezing the payment gateway for 60 seconds

RDS CPU at 94%: how missing indexes on the wrong table cost $566/month — and why you must measure before rightsizing

Resque queue with 1,480 stuck jobs: diagnosing job accumulation when the worker is alive and Redis is healthy

The alert that never came

The diagnosis: simple math revealed the structural problem

Why the other queues suffered too

How to identify the pattern: diagnostic commands

The observability gap: no queue depth alarm

The solutions: from quick fix to correct

Option 1 — Dedicated worker per slow queue (no code changes)

Option 2 — Job-level parallelism (code change)

The pattern applies beyond Resque

Operational lesson

Related articles

504 with no high CPU, no queue, no RDS: when the infrastructure is green but the payment gateway stopped responding

Late-night deploy took down the fintech: why ASG min=1 and always-on CI/CD are incompatible

CURLOPT_TIMEOUT = 0: the infinite timeout freezing the payment gateway for 60 seconds

RDS CPU at 94%: how missing indexes on the wrong table cost $566/month — and why you must measure before rightsizing