Just putting up this quick post to link to a job queue simulator to help prove out some stuff I've been working on.
The main points are:
- multiple tenants
- one shared queue
- one active worker per tenant
Without any additions, the problem you'll run into is you'll have large single-tenant chunks that will block useful work from occurring.
The first mitigation that was put in place was striping
. If you know ahead of time what the work will be, you could stripe the jobs such that you go from aaabbbccc to abcabcabc. However, in my case, I had jobs that would themselves enqueue a variable number of child jobs, leading to those chunks again.
So, the next idea was to re-enqueue a job when another worker has acquired a lock for the tenant. To keep workers from "churning" and burning CPU, I put a slight cooldown timer on the worker-side, and put the jobs in a delay queue on the message-broker side.
The most recent change was to use a random delay queue. This helps with the scenario where you get 1000 jobs of tenant X and 10 jobs of tenant Y. As things progress, one worker will acquire the first tenant-Y job, but the remaining 9 will likely be thrown to the back of the queue, because the other workers cannot acquire a lock. This creates a scenario where the processing time is heavily correlated with "cycle time" of the queue; Those 10 jobs will take roughly 10 x cycle time
to all finish, which is much longer than just running them serially.
By enqueuing them to a random TTL queue, we are able to break them up over time.
Some additional ideas include:
Create another type of consumer on the queue that will prefetch many messages at a time, shuffle them in memory, then re-enqueue them in this newly shuffled order. This should be able to do fewer passes and achieve better results than the random delay method.
Allow workers to prefetch many messages at once. Once these messages are retrieved, it should get a count for these jobs by tenant ID. Then the worker should acquire a lock for the largest possible unclaimed tenant-group. The remaining jobs should be returned to the queue. The worker will maintain the lock and work through the kept jobs, releasing the lock when all internally held jobs are finished.
Not a real queue change, but an enhancement to the sim:
Adding a switch for "auto-scale workers" that could base it off of queue length or queue variety, and allow configuration of the scaling factor and limits.