RabbitMQ Message TTL Gotcha

RabbitMQ is a standalone message-broker software that, with its many out-of-the-box features, is useful for many scenarios involving distributing tasks to multiple workers.

I started using RabbitMQ as a backing queue for work that involved gathering and processing data from other websites and APIs.

Sometimes these tasks might fail intermittently, like a worker being unable to access a resource because of random network issues. In this case, I wanted to requeue the job at a later time, multiple times if necessary, using exponential backoff as a basis for each delay.

Luckily, one of the tools available to RabbitMQ users is controlling the time-to-live of messages. This can be done at the level of a queue or of individual messages.

So I thought I'd just keep track of failure counts in each message, calculate and set message TTL in my producer code, and chuck all the messages into a single queue dedicated to things needing to be retried. This queue would also have its dead letter exchange set, so that expiring messages would be routed accordingly.

There was an issue with this approach though, one that I caught while running a test batch of jobs with random failures.

The messages were not flowing out of the retry queue and into the exchange how I had expected them to. There were starts and stops. Searching around, I found the answer right on the same TTL page, under "Caveats":

When setting per-message TTL expired messages can queue up behind non-expired ones until the latter are consumed or expired.

The solution is just a bit more complicated. Rather than having one queue with all messages of varying TTL, you create queues with each possible TTL, and route your retrying jobs accordingly.

Running the test again, I found that this solution requeued failed jobs as expected, without the starts and stops of the previous attempt.