Buzeli
buzeliSoluções Digitais
Costs

RDS CPU at 94%: how missing indexes on the wrong table cost $566/month — and why you must measure before rightsizing

Published on May 1, 2026

The starting point: 94% CPU with 8 vCPUs

The RDS for a fintech client was on db.r6g.2xlarge — 8 vCPUs, 64 GB RAM, $747/month On-Demand. In the week before the intervention, CPU peaked at 94.1% on a Thursday. Other weekdays: 27%, 43%, frequent spikes above 20%.

Memory was at 57.4 GB free out of 64 GB — 90% idle. Connections: average 1.6 to 2.9, maximum 15 under normal circumstances. The database wasn't under connection or memory pressure. Only CPU.

High CPU with idle memory and low connections in a relational database usually points to one cause: queries doing work that indexes should do. The database is scanning entire tables to find rows that an index would find in microseconds.

The audit: the most-queried table with zero indexes

The index script left by the development team in the repository covered 13 tables — session, analytics, cookies, user, plans, affiliation, commission, abandoned cart, invoice. None of them was the 'transactions' table.

The 'transactions' table was the most queried in the entire system. Every operation that mattered passed through it:

Reseller reports: dozens of queries per page with filters on reseller + order_status

Tracking cron (EventBridge every 15 min): full scan with filters on pad + tracking_last_update

Exports: ORDER BY last_event_date DESC without index = filesort on 48k+ rows

Duplicate rejection at checkout: lookup by buyer + status on every new transaction

With 52,678 rows in the table at the time of audit, every query without an index scanned the ~48k active rows. The tracking cron fired every 15 minutes, queuing ~1,480 jobs — each job made a selection query on the 'transactions' table without an index.

The BEFORE benchmark: numbers that justify the work

Before creating any index, critical queries were executed with EXPLAIN and real-time measurement. All queries on the 'transactions' table returned 'type=ALL, key=NULL' in EXPLAIN — full table scan confirmed:

Copy
-- Tracking cron: 34.48ms, 48,858 rows scanned
EXPLAIN SELECT * FROM orders
WHERE tracking_code IS NOT NULL
  AND pad = 1
  AND (tracking_last_description IS NULL
       OR tracking_last_description NOT IN (
           'Final status 1', 'Final status 2'
       ))
  AND (tracking_last_update IS NULL
       OR tracking_last_update
          BETWEEN (NOW() - INTERVAL 1 MONTH) AND (NOW() - INTERVAL 30 MINUTE));
-- type: ALL | key: NULL | rows: 48858 | Extra: Using where

-- Export: 31.17ms + filesort
EXPLAIN SELECT * FROM orders ORDER BY last_event_date DESC LIMIT 100;
-- type: ALL | key: NULL | rows: 48858 | Extra: Using filesort

-- Authentication: 20.76ms
EXPLAIN SELECT * FROM user WHERE login = 'email@example.com';
-- type: ALL | key: NULL | rows: 51481

Total time for the 15 BEFORE benchmark queries: ~624ms. With indexes, the target was below 200ms for the same set.

The indexes: design choices

The 'transactions' table received 7 indexes. The choices were guided by actual queries from the code, not by 'columns with many values' criteria:

Copy
-- CRITICAL: report queries (frequent filters)
CREATE INDEX idx_tx_reseller_status
  ON orders (reseller, order_status) ALGORITHM=INPLACE LOCK=NONE;

CREATE INDEX idx_tx_vendor_status
  ON orders (vendor, order_status) ALGORITHM=INPLACE LOCK=NONE;

-- CRITICAL: tracking cron every 15 minutes
CREATE INDEX idx_tx_tracking
  ON orders (pad, tracking_last_update) ALGORITHM=INPLACE LOCK=NONE;

-- HIGH: duplicate rejection at checkout
CREATE INDEX idx_tx_buyer_status
  ON orders (buyer, order_status) ALGORITHM=INPLACE LOCK=NONE;

-- HIGH: exports with ORDER BY
CREATE INDEX idx_tx_last_event_date
  ON orders (last_event_date) ALGORITHM=INPLACE LOCK=NONE;

-- HIGH: queries by time range
CREATE INDEX idx_tx_created_at
  ON orders (created_at) ALGORITHM=INPLACE LOCK=NONE;

-- HIGH: isolated status filters
CREATE INDEX idx_tx_status
  ON orders (order_status) ALGORITHM=INPLACE LOCK=NONE;

'ALGORITHM=INPLACE LOCK=NONE' is mandatory on active production tables in MariaDB 10.11+. Without it, index creation acquires a metadata lock that blocks reads and writes on the table for the entire operation. With INPLACE, the table remains available.

Why composite indexes instead of simple ones? The query 'reseller IS NOT NULL AND order_status IN (4, 17)' benefits much more from a composite index (reseller, order_status) than from two separate simple indexes. The optimizer uses the composite index to filter directly — with two simple indexes, it picks one and uses the other as a filter, reducing the gain.

The execution: 23 indexes, 3.91 seconds

The 23 indexes were created in production via PHP script executed directly in the container. Total time: 3.91 seconds. Zero errors. Zero downtime.

The AFTER benchmark, executed immediately after creation:

Copy
| Query                              | BEFORE          | AFTER                                    | Gain   |
|-----------------------------------|-----------------|------------------------------------------|--------|
| transactions ORDER BY last_event_date   | ALL 31.17ms     | index/idx_tx_last_event_date 0.71ms  | 43.9x  |
| user login                        | ALL 20.76ms     | ref/idx_usuario_login 0.75ms             | 27.7x  |
| transactions created_at 30d             | ALL 30.91ms     | range/idx_tx_created_at 1.82ms       | 17x    |
| analytics created_at 7d           | ALL 120.25ms    | range/idx_analytics_created_at 11.38ms   | 10.6x  |
| session created_at 30d            | ALL 88.83ms     | range/idx_sessao_created_at 9.06ms       | 9.8x   |
| transactions order_status         | ALL 32.23ms     | range/idx_tx_status 5.49ms | 5.9x   |
| cookies created_at 30d            | ALL 130.33ms    | range/idx_cookies_created_at 31.34ms     | 4.2x   |

Overall reduction: 75% in total time across the 15 benchmark queries. 12 of 13 queries that were doing full table scans eliminated the ALL.

The CloudWatch impact: CPU peak 94.1% → 26%

The week before indexes (March 3-7, db.r6g.2xlarge):

Copy
| Day      | Weekday | Avg   | Max peak |
|---------|---------|-------|----------|
| Mar 3   | Mon     | 2.52% |   10.6%  |
| Mar 4   | Tue     | 2.57% |   43.6%  |
| Mar 5   | Wed     | 2.50% |   27.3%  |
| Mar 6   | Thu     | 2.54% |   94.1%  |
| Mar 7   | Fri     | 1.74% |   19.0%  |
| Average |         | 2.37% |   38.9%  |

The week after indexes (March 10-14, same db.r6g.2xlarge instance):

Copy
| Day      | Weekday | Avg   | Max peak |
|---------|---------|-------|----------|
| Mar 10  | Mon     | 2.30% |   14.0%  |
| Mar 11  | Tue     | 2.94% |   26.0%  |
| Mar 12  | Wed     | 2.42% |   11.7%  |
| Mar 13  | Thu     | 2.17% |    8.3%  |
| Mar 14  | Fri     | 1.51% |    8.1%  |
| Average |         | 2.27% |   13.6%  |

Absolute peak: from 94.1% to 26.0%. Average of peaks: from 38.9% to 13.6% — 65% reduction. The only peak above 20% in the second week (26% on Tuesday) was caused by an operational deploy event, not database load.

The correct sequence: indexes first, rightsize later

With indexes validated over a full business week, rightsizing was possible with real data. The logic is simple: if you rightsize before indexes, you're sizing the instance for an abnormal load — with full scans that shouldn't exist.

Buying an RI before rightsizing is the most expensive mistake: you lock in 1 or 3 years at a size that will necessarily be reduced after indexes. A Reserved Instance on a db.r6g.2xlarge paid upfront and then discarded for an xlarge represents lost capital.

The correct sequence:

1. Indexes: eliminate full scans, measure real CPU impact

2. Wait for a full business week: collect CPU metrics under real production load

3. Rightsize: size for post-index load, not pre-index load

4. Wait for another full business week: validate that the smaller instance handles the load

5. Reserved Instance: only after the correct size is confirmed by real data

The rightsizing: db.r6g.2xlarge → xlarge

With data from the March 10-14 week confirming a maximum peak of 26% (and that peak being an operational event, not real load), the projection for db.r6g.xlarge (4 vCPUs) was straightforward: percentages double with half the vCPUs.

Rightsizing was scheduled with '--no-apply-immediately' (applied at the next maintenance window, 01:00 BRT Saturday). The RDS was unavailable for ~10-15 minutes during the window.

In the first full business week with db.r6g.xlarge (March 17-21):

Copy
| Day      | Weekday | Avg   | Max peak |
|---------|---------|-------|----------|
| Mar 17  | Mon     | 5.00% |   31.6%  |
| Mar 18  | Tue     | 4.52% |   39.4%  |
| Mar 19  | Wed     | 4.75% |   40.9%  |
| Mar 20  | Thu     | 4.18% |   20.1%  |
| Mar 21  | Fri     | 2.76% |   14.4%  |
| Average |         | 4.24% |   29.3%  |

Maximum peak of 40.9% on Wednesday — and that peak occurred in a single minute at 3pm local time (social media traffic burst). The hourly average was 8.2%. Memory: 30 GB free out of 32 GB — comfortable. Rightsizing was validated.

Further reduction to db.r6g.large (2 vCPUs) was discarded: projected maximum peak ~82% with no safety margin for unpredictable spikes. The additional savings of $213/month didn't justify the risk.

The Reserved Instance: $193/month additional savings

With the correct size validated over a full week, the Reserved Instance was purchased: db.r6g.xlarge, 1 year, No Upfront, Single-AZ, MariaDB, us-east-1.

Copy
| RI Option          | Upfront | Hourly  | Monthly equiv | Savings vs OD  |
|-------------------|---------|---------|---------------|----------------|
| On-Demand         | $0      | $0.519  | $374/month    | —              |
| 1yr No Upfront    | $0      | $0.248  | $181/month    | $193/month (52%)|
| 1yr Partial       | $1,035  | $0.118  | $172/month    | $202/month (54%)|
| 1yr All Upfront   | $2,028  | $0      | $169/month    | $205/month (55%)|

Choice: 1yr No Upfront ($0 upfront, $0.248/h). Reason: the difference between No Upfront and All Upfront is $12/month — not worth locking up $2,028 in capital. With a 3yr RI, the difference vs 1yr No Upfront is $65/month in additional savings, but requires 36 months of certainty.

The final tally: $566/month, $6,792/year

The complete database optimization cycle (excluding EC2 RIs, covered in another post):

Copy
| Step                                 | Savings/month | Running total |
|-------------------------------------|--------------|---------------|
| Baseline: db.r6g.2xlarge On-Demand  | —            | $747/month    |
| After indexes + rightsizing (xlarge)| $373/month   | $374/month    |
| After 1yr No Upfront RI             | $193/month   | $181/month    |
| Total saved                         | $566/month   | $6,792/year   |
The 'transactions' table had zero indexes. That's the detail that matters most in this case. It wasn't a question of advanced database tuning — it was the absence of basics on the table that mattered most. The RDS was paying $747/month to do work that indexes would have done for free.

How to replicate this cycle

The five steps that turn 'RDS with high CPU' into 'optimized database with correct cost':

1. Audit actual queries: collect the 10-20 most frequent queries from the code (not estimates), run with EXPLAIN, identify full scans on the highest-volume tables.

2. Indexes with ALGORITHM=INPLACE LOCK=NONE: on active production tables, always. Without this, index creation can lock the table for minutes.

3. Before and after benchmark: document access type (ALL vs ref vs range), rows scanned, and execution time for each critical query. Without a benchmark, you don't know the real impact.

4. Rightsize after a business week: never size based on pre-index CPU metrics. Wait for real post-optimization data.

5. Reserved Instance after validated rightsize: only reserve when the correct size is confirmed. RI on the wrong size is locked capital that will be discarded.