简体   繁体   中英

Queued Laravel Notifications get stuck on AWS SQS

I have a worker on AWS that handles queued Laravel notifications. Some of the notifications get send out, but others get stuck in the queue and I don't know why.

I've looked at the logs in Beanstalk and see three different types of error:

2020/11/03 09:22:34 [emerg] 10932#0: *30 malloc(4096) failed (12: Cannot allocate memory) while reading upstream, client: 127.0.0.1, server: , request: "POST /worker/queue HTTP/1.1", upstream: "fastcgi://unix:/run/php-fpm/www.sock:", host: "localhost"

I see an Out of Memory issue on Bugsnag too, but without any stacktrace.

Another error is this one:

2020/11/02 14:50:07 [error] 10241#0: *2623 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 127.0.0.1, server: , request: "POST /worker/queue HTTP/1.1", upstream: "fastcgi://unix:/run/php-fpm/www.sock", host: "localhost"

And this is the last one:

2020/11/02 15:00:24 [error] 10241#0: *2698 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 127.0.0.1, server: , request: "POST /worker/queue HTTP/1.1", upstream: "fastcgi://unix:/run/php-fpm/www.sock:", host: "localhost"

I don't really understand what I can do to resolve these errors. It's just a basic Laravel / EBS / SQS setup, and the only thing the queue has to do is handle notifications. Sometimes a couple of dozens at a time. I'm running a t2.micro , and would assume that's enough to send a few e-mails? I've upped the environment to a t2.large but to no avail.

I notice that messages end up in the queue, then get the status 'Messages in flight', but then run into all sorts of troubles on the Laravel side. But I don't get any useful errors to work with.

All implementation code seems to be fine, because the first few notifications go out as expected and if I don't queue at all, all notifications get dispatched right away.

The queued notifications eventually generate two different exceptions: MaxAttemptsExceededException and an Out of Memory FatalError , but neither leads me to the actual underlying problem.

Where do I look further to debug?


UPDATE

See my answer for the problem and the solution. The database transaction hadn't finished before the worker tried to send a notification for the object that still had to be created.

What is the current memory_limit assigned to PHP? You can determine this by running this command:

php -i | grep memory_limit

You can increase this by running something like:

sed -i -e 's/memory_limit = [current-limit]/memory_limit = [new-limit]/g' [full-path-to-php-ini]

Just replace the [current-limit] with the value displayed in the first command, and [new-limit] with a new reasonable value. This might require trial and error. Replace [full-path-to-php-ini] with the full path to the php.ini that's used by the process that's failing. To find this, run:

php -i | grep php.ini

First make sure that you increased the max_execution_time and also memory_limit
Also make sure that you set --timeout option
Then make sure you follow the instruction for Amazon SQS as in laravel doc says

The only queue connection which does not contain a retry_after value is Amazon SQS. SQS will retry the job based on the Default Visibility Timeout which is managed within the AWS console.

Job Expirations & Timeouts

If you are sure that some of the queued events are correctly received and processed by the worker Laravel, then as others said it's mostly a PHP memory issue.

On beanstalk, here's what I added to my ebextensions to get larger memory for PHP (it was for composer memory issues):

Note that this is with a t3.medium EC2 instance with 4go, dedicated for laravel API only.

 02-environment.config commands: ... option_settings: ... - namespace: aws:elasticbeanstalk:container:php:phpini option_name: memory_limit value: 4096M - namespace: aws:ec2:instances option_name: InstanceTypes value: t3.medium

So you can try to increase the limit use more of your available instance max ram, and deploy again so beanstalk will rebuild the instance and setup PHP memory_limit .

Note: the real config contains other configuration files and more truncated contents of course.

As you said, you are just sending an email so it should be ok. Is it happening when there's a burst of email queued? Is there, in the end, many events in the SQS deadLetterQueue? If so, it may be because of a queued email burst. And so SQS will "flood" the /worker route to execute your jobs. You could check server usage from AWS console, or htop like CLI tools to monitor, and also check SQS interface to see if many failed jobs are coming at the same moments (burst).

Edit: for elastic beanstalk, I use dusterio/laravel-aws-worker , maybe you too as your log mentions the /worker/queue route

Memory

The default amount of memory allocated to PHP can often be quite small. When using EBS, you want to use config files as much as possible - any time you're having to SSH and change things on the server, you're going to have more issues when you need to redploy. I have this added to my EBS config /.ebextensions/01-php-settings.config :

option_settings:
  aws:elasticbeanstalk:container:php:phpini:
    memory_limit: 256M

That's been enough when running a t3.micro to do all my notification and import processing. For simple processing it doesn't usually need much more memory than the default, but it depends a fair bit on your use-case and how you've programmed your notifications.

Timeout

As pointed out in this answer already, the SQS queue operates a little differently when it comes to timeouts. This is a small trait that I wrote to help work around this issue:

<?php

namespace App\Jobs\Traits;

trait CanExtendSqsVisibilityTimeout
{
    /** NOTE: this needs to map to setting in AWS console */
    protected $defaultBackoff = 30;

    protected $backoff = 30;

    /**
     * Extend the time that the job is locked for processing
     *
     * SQS messages are managed via the default visibility timeout console setting; noted absence of retry_after config
     * @see https://laravel.com/docs/7.x/queues#job-expirations-and-timeouts
     * AWS recommends to create a "heartbeat" in the consumer process in order to extend processing time:
     * @see https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-visibility-timeout.html#configuring-visibility-timeout
     *
     * @param int $delay  Number of seconds to extend the processing time by
     *
     * @return void
     */
    public function extendBackoff($delay = 60)
    {
        if ($this->job) {
            // VisibilityTimeout has a 12 hour (43200s) maximum and will error above that; no extensions if close to it
            if ($this->backoff + $delay > 42300) {
                return;
            }
            // add the delay
            $this->backoff += $delay;
            $sqs = $this->job->getSqs();
            $sqsJob = $this->job->getSqsJob();
            $sqs->changeMessageVisibility([
                'QueueUrl' => $this->job->getQueue(),
                'ReceiptHandle' => $sqsJob['ReceiptHandle'],
                'VisibilityTimeout' => $this->backoff,
            ]);
        }
    }
}

Then for a queued job that was taking a long time, I changed the code a bit to work out where I could insert a sensible "heartbeat". In my case, I had a loop:

class LongRunningJob implements ShouldQueue
{
    use CanExtendSqsVisibilityTimeout;

    //...

    public function handle()
    {
        // some other processing, no loops involved

        // now the code that loops!
        $last_extend_at = time();
        foreach ($tasks as $task) {
            $task->doingSomething();

            // make sure the processing doesn't time out, but don't extend time too often
            if ($last_extend_at + $this->defaultBackoff - 10 > time()) {
                // "heartbeat" to extend visibility timeout
                $this->extendBackoff();
                $last_extend_at = time();
            }
        }
}

Supervisor

It sounds like you might need to look at how you're running your worker(s) in a bit more detail.

Having Supervisor running to help restart your workers is a must, I think. Otherwise if the worker(s) stop working, messages that are queued up will end up getting deleted as they expire. It's a bit fiddly to get working nicely with Laravel + EBS - there isn't really much good documentation around it, which is potentially why not having to manage it is one of the selling points for Vapor!

We finally found out what the problem was, and it wasn't memory or execution time.

Already from the beginning I thought it was strange that either default memory or default execution time wasn't sufficient to send an e-mail or two.

Our use case is: a new Article is created and users receive a notification.

A few clues that led to the solution:

  • We noticed that we usually have problems with the first notification.
  • If we create 10 articles at the same time, we miss the first notification on every article.
  • We set the HTTP Max Connections in the Worker to 1. When creating 10 articles simultaneously, we noticed that only the first article missed the first notification.
  • We didn't get any useful error messages from the Worker, so we decided to set up our own EC2 and run php artisan queue manually.

What we then saw explained everything: Illuminate\\Database\\Eloquent\\ModelNotFoundException: No query results for model [App\\Article]

This is an error that we never got from the EBS Worker / SQS and swiftly led to the solution:

The notification is handled before the article has made it to the database.

We have added a delay to the worker and haven't had a problem since then. We recently added a database transaction to the process of creating an article, and creating the notification happens within that transaction (but on the very end). I think that's why we didn't have this problem before. We decided to leave the notification creation in the transaction, and just handle the notifications with a delay. This means we don't have to do a hotfix to get this solved.

Thanks to everyone who joined in to help!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM