简体   繁体   中英

puppeteer-cluster: Setting a timeout on individual execution tasks

I'm trying to get individual tasks to throw a time-out during stress testing to see what my calling program will do. However, my cluster keeps tasks fresh indefinitely. It appears to queue all my cluster.execute calls which then are kept in memory and return their results to listeners that have long since disconnected.

The docs state:

timeout <number> Specify a timeout for all tasks. Defaults to 30000 (30 seconds).

My cluster launch configuration:

const cluster = await Cluster.launch({
    concurrency: Cluster.CONCURRENCY_CONTEXT,
    maxConcurrency: 1,
    timeout: 1000 //milliseconds
});

I'm calling the queuing mechanism using:

const pdf = await cluster.execute(html, makePdf);

Where makePdf is an async function that expects a HTML string, fills a page with it and prints a PDF using the default puppeteer .

const makePdf = async ({ page, data: html, worker }) => {
    await page.setContent(html);
    let pdf = await page.pdf({});
    console.log('worker ' + worker.id + ' task ' + count);
    return pdf;
};

I sort of expected the queue to start emptying itself until it found a task that didn't exceed its timeout value. I've tried setting the timeout to 1 ms but this doesn't trigger a timeout either. I've tried moving this code to a cluster.task as described in the examples to see if that would trigger the setting, but no such luck. How do I get already queued requests to time out? Does this even work if I'm not scraping websites or connecting to anything?

I'm considering to pass a timestamp along with my tasks so it can skip doing anything for requests that have expired on the calling side, but I'd rather use built-in options wherever possible.

EDIT:

Thanks to Thomas's clarification I've decided to build this little optimization to prevent tasks where the listeners are long gone from executing.

Swap the content of data from just html with a json that has both the url and timestamp:

let timestamp = new Date();
await cluster.execute({html, timestamp});

Ignore any queued task where the listener has timed out:

const makePdf = async ({ page, data: { html, timestamp }, worker }) => {
    let time_since_call = (new Date() - timestamp);
    if (time_since_call < timeout_ms) {
        await page.setContent(html);
        let pdf = await page.pdf({});
        return pdf;
    } 
};

This is a misunderstanding what timeout does. The timeout option is the timeout for the task, meaning that the job itself (after leaving the queue) cannot take longer than the specified timeout. The option does not cancel a queued job that is still in the queue.

Example:

const cluster = await Cluster.launch({
    // ...
    maxConcurrency: 1,
    timeout: 1000 // one second
});
// ...
for (let i = 0; i < 10; i += 1) {
    cluster.queue('...');
}

This code adds 10 jobs and runs them sequentially (as maxConcurrency is 1 ). There is no different between queue and execute here (see this question for more information on this topic). So what happens is the following:

  • First job starts running
  • First job is interrupted after one second
  • Second job starts running
  • Second job is interrupted after one second
  • ...

The use case you are describing is currently not supported by the library (btw, disclaimer: I'm the author), but as you proposed, you could add a timestamp to the object you are queuing and cancel the job right away if it is too far in the past.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM