简体   繁体   English

NodeJS HTTP 请求队列

[英]NodeJS HTTP Request Queue

I've created scraper using puppeteer & node js (express).我使用 puppeteer 和 node js (express) 创建了刮板。 The idea is when server received http request then my apps will start scraping the page.这个想法是当服务器收到 http 请求时,我的应用程序将开始抓取页面。

The problem is if my apps receive multiple http request at one time.问题是我的应用程序是否一次收到多个 http 请求。 Scraping process will start over and over again until no http request hit.抓取过程将一遍又一遍地开始,直到没有 http 请求命中。 How do i start only one http request and queue the other request until the first scraping process finish ?我如何只启动一个 http 请求并将另一个请求排队,直到第一个抓取过程完成?

Currently, i've tried node-request-queue with codes below but no lucks.目前,我已经尝试过使用以下代码的节点请求队列,但没有成功。

 var express = require("express"); var app = express(); var reload = require("express-reload"); var bodyParser = require("body-parser"); const router = require("./routes"); const RequestQueue = require("node-request-queue"); app.use(bodyParser.urlencoded({ extended: true })); app.use(bodyParser.json()); var port = process.env.PORT || 8080; app.use(express.static("public")); // static assets eg css, images, js let rq = new RequestQueue(1); rq.on("resolved", res => {}) .on("rejected", err => {}) .on("completed", () => {}); rq.push(app.use("/wa", router)); app.listen(port); console.log("Magic happens on port " + port);

node-request-queue is created for request package, which is different than express . node-request-queue是为request包创建的,不同于express

You can accomplish the queue using simplest promise queue library p-queue .您可以使用最简单的承诺队列库p-queue来完成队列 It has concurrency support and looks much more readable than any other libraries.它具有并发支持并且看起来比任何其他库都更具可读性。 You can easily switch away from promises to a robust queue like bull at a later time.稍后您可以轻松地从 promise 切换到像bull这样的健壮队列。

This is how you can create a queue,这是您创建队列的方法,

const PQueue = require("p-queue");
const queue = new PQueue({ concurrency: 1 });

This is how you can add an async function to queue, it will return resolved data if you listen to it,这是您如何向队列添加异步函数的方法,如果您收听它,它将返回已解析的数据,

queue.add(() => scrape(url));

So instead of adding route to queue, you just remove other lines around it and keep the router as is.因此,与其将路由添加到队列中,您只需删除它周围的其他线路并保持路由器原样。

// here goes one route
app.use('/wa', router);

Inside one of your router file,在您的路由器文件之一中,

const routes = require("express").Router();

const PQueue = require("p-queue");
// create a new queue, and pass how many you want to scrape at once
const queue = new PQueue({ concurrency: 1 });

// our scraper function lives outside route to keep things clean
// the dummy function returns the title of provided url
const scrape = require('../scraper');

async function queueScraper(url) {
  return queue.add(() => scrape(url));
}

routes.post("/", async (req, res) => {
  const result = await queueScraper(req.body.url);
  res.status(200).json(result);
});

module.exports = routes;

Make sure to include the queue inside the route, not the other way around.确保将队列包含在路由内,而不是相反。 Create only one queue on your routes file or wherever you are running the scraper .只在你的routes文件或任何你运行爬虫程序的地方创建一个队列。

Here is contents of scraper file, you can use any content you want, this is just an working dummy,这是scraper文件的内容,你可以使用任何你想要的内容,这只是一个工作假人,

const puppeteer = require('puppeteer');

// a dummy scraper function
// launches a browser and gets title
async function scrape(url){
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto(url);
  const title = await page.title();
  await browser.close();
  return title
}

module.exports = scrape;

Result using curl:使用 curl 的结果:

在此处输入图片说明

Here is my git repo which have working code with sample queue.这是我的 git repo ,其中包含带有示例队列的工作代码。

Warning警告

If you use any of such queue, you will notice you have problem dealing with 100 of results at same time and request to your api will keep timing out because there are 99 other url waiting in the queue.如果您使用任何此类队列,您会注意到您在同时处理 100 个结果时遇到问题,并且对您的 api 的请求将持续超时,因为队列中还有 99 个其他 url 正在等待。 That is why you have to learn more about real queue and concurrency at a later time.这就是为什么你必须在以后了解更多关于真正的队列和并发的原因。

Once you understand how queue works, the other answers about cluster-puppeteer, rabbitMQ, bull queue etc, will help you at that time :) .一旦您了解了队列的工作原理,有关 cluster-puppeteer、rabbitMQ、公牛队列等的其他答案将在那时对您有所帮助:)。

You can use puppeteer-cluster for that (disclaimer: I'm the author).您可以为此使用puppeteer-cluster (免责声明:我是作者)。 You can setup a cluster with a pool of only one worker.您可以设置一个只有一个工作线程池的集群。 Therefore, the jobs given to the cluster will be executed one after another.因此,分配给集群的作业将一个接一个地执行。

As you did not say what your puppeteer script should be doing, in this code example I'm extracting the title of a page as an example (given via /wa?url=... ) and providing the result to the response.由于您没有说明您的 puppeteer 脚本应该做什么,因此在此代码示例中,我将提取页面标题作为示例(通过/wa?url=... )并将结果提供给响应。

// setup the cluster with only one worker in the pool
const cluster = await Cluster.launch({
    concurrency: Cluster.CONCURRENCY_CONTEXT,
    maxConcurrency: 1,
});

// define your task (in this example we extract the title of the given page)
await cluster.task(async ({ page, data: url }) => {
    await page.goto(url);
    return await page.evaluate(() => document.title);
});

// Listen for the request
app.get('/wa', async function (req, res) {
    // cluster.execute will run the job with the workers in the pool. As there is only one worker
    // in the pool, the jobs will be run sequentially
    const result = await cluster.execute(req.query.url);
    res.end(result);
});

This is a minimal example.这是一个最小的例子。 You might want to catch any errors in your listener.您可能希望捕获侦听器中的任何错误。 For more information check out a more complex example with a screenshot server using express in the repository.有关更多信息,请查看使用存储库中的 express 的屏幕截图服务器的更复杂示例。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM