简体   繁体   English

Express 路由器 API 中带有 cheerio 的 Puppeteer 集群返回空响应

[英]Puppeteer-cluster with cheerio in express router API returns empty response

I'm writing an API with express, puppeteer-cluster and cheerio that returns all anchor elements containing one or more words that can be added as query parameters.我正在编写一个带有 express、puppeteer-cluster 和 cheerio 的 API,它返回所有包含一个或多个可以添加为查询参数的单词的锚元素。 I want to use puppeteer in order to get elements that are javascript generated too.我想使用 puppeteer 来获取也由 javascript 生成的元素。 But for some reason it's not working, I get an empty array as an output printed on the browser.但由于某种原因它不起作用,我得到一个空数组作为浏览器上打印的输出。

I'm still trying to understand this library but has been 2 days and I made no progress.我仍在尝试了解这个库,但已经过去 2 天了,我没有取得任何进展。 Any help is deeply appreciated.非常感谢任何帮助。

Update: I added async to all my functions and they run now, but the result is still empty:(更新:我向我的所有函数添加了异步,它们现在运行了,但结果仍然是空的:(

Update 2: I started logging everything, every step and found that data.name is being passed to the cheerio function as a Promise.更新 2:我开始记录所有内容、每一步,发现 data.name 正在作为 Promise 传递给 cheerio 函数。 '-' I think that is the problem, but don't know how to fix it yet. '-' 我认为这是问题所在,但还不知道如何解决。

Update 3: One of the issues was that the page content (html code) was not being handled properly to the cheerio function.更新 3:其中一个问题是页面内容(html 代码)未正确处理到 cheerio 函数。 In the browser, however, the response is empty and the console shows an error:然而,在浏览器中,响应为空并且控制台显示错误:

Error handling response: TypeError: Cannot read properties of undefined (reading 'innerText').错误处理响应:类型错误:无法读取未定义的属性(读取“innerText”)。

So, I think the response is not json formatted.所以,我认为响应不是 json 格式的。 Is res.json() not the right way to do it? res.json()不是正确的方法吗?

My code:我的代码:

app.js应用程序.js

const PORT = process.env.PORT || 8000;
var path = require("path");
const express = require("express");

// Routes
const indexRouter = require("./routes/index");
const allNews = require("./routes/news");
const clusterRouter = require("./routes/cluster");

const app = express();
app.use(cors());
app.use(express.json());
app.use(express.urlencoded({ extended: false }));
app.use(express.static(path.join(__dirname, "public")));

app.use("/", indexRouter);
app.use("/news", allNews);
app.use("/cluster", clusterRouter);

app.listen(PORT, () => console.log(`server running on PORT ${PORT}`));

cluster.js集群.js

const express = require("express");
const { Cluster } = require("puppeteer-cluster");
const puppeteer = require("puppeteer-extra");
const cheerio = require("cheerio");
const StealthPlugin = require("puppeteer-extra-plugin-stealth");

var router = express.Router();
const newspapers = [
  {
    "name": "CNN",
    "address": "https://edition.cnn.com/specials/world/cnn-climate",
    "base": "https://edition.cnn.com"
  },
  {
    "name": "The Guardian",
    "address": "https://www.theguardian.com/environment/climate-crisis",
    "base": "https://www.theguardian.com"
  }]

const app = express();
puppeteer.use(StealthPlugin());

const result = [];

router.get("/", async (req, res) => {
  (async () => {
    // Query String
    const query = checkForQuery(req);
    const wordsToSearch = query ? verifyQuery(query) : "";

    console.log("Running tests.."); // This is printed on console
    
    //Functions
    function checkForQuery(request) {
      if (request.originalUrl.indexOf("?") !== -1) {
        console.log(request.query);
        return request.query;
      } else {
        return false;
      }
    }

    // // Validates query and remove invalid values
    function verifyQuery(queryString) {
      const queryParams = {
        only: queryString.only ? queryString.only : "",
        also: queryString.also ? queryString.also : "",
      };
      // Creates new list containing valid terms for search
      var newList = {
        only: [],
        also: [],
      };

      for (const [key, value] of Object.entries(queryParams)) {
        const tempId = key.toString();
        const tempVal =
          queryParams[tempId].length >= 2
            ? queryParams[tempId].split(",")
            : queryParams[tempId];
        console.log(queryParams[tempId], " and ", tempVal);
        if (tempVal.length > 1) {
          console.log("helloooooo");
          tempVal.forEach((term) => {
            if (topics.indexOf(term) != -1) {
              newList[tempId].push(term);
            }
          });
        } else {
          if (topics.indexOf(queryParams[tempId]) != -1) {
            newList[tempId].push(queryParams[tempId]);
          }
        }
      }
      console.log(newList);
      return newList;
    }

    function storeData(element, base, name) {
      const results = [];
      element.find("style").remove();
      const title = element.text();
      const urlRaw = element.attr("href");
      const url =
        urlRaw.includes("www") || urlRaw.includes("http")
          ? urlRaw
          : base + urlRaw;

      // Check for duplicated url
      if (tempUrls.indexOf(url) === -1) {
        // Check for social media links and skip
        if (!exceptions.some((el) => url.toLowerCase().includes(el))) {
          tempUrls.push(url);

          // Get img if child of anchor tag
          const imageElement = element.find("img");
          if (imageElement.length > 0) {
            // Get the src attribute of the image element

            results.push({
              title: title.replace(/(\r\n|\n|\r)/gm, ""),
              url,
              source: name,
              imgUrl: getImageFromElement(imageElement),
            });
          } else {
            results.push({
              title: title.replace(/(\r\n|\n|\r)/gm, ""),
              url: url,
              source: name,
            });
          }
        }
      }
      return results;
    }

    function getElementsCheerio(html, base, name, searchterms) {
      console.log(html, base, name);
      const $ = cheerio.load(html);
      console.log(searchterms);
      const concatInfo = [];

      if (searchterms) {
        const termsAlso = searchterms.also;
        const termsOnly = searchterms.only;

        termsAlso.forEach((term) => {
          $(`a:has(:contains("climate"):contains(${term}))`).each(function () {
            const tempData = storeData($(this), base, name);
            tempData.map((el) => concatInfo.push(el));
          });
        });

        termsOnly.forEach((term) => {
          // $(`a:has(:contains(${term}))`).each(function () {
          $(`a:contains(${term})`).each(function () {
            const tempData = storeData($(this), base, name);
            tempData.map((el) => concatInfo.push(el));
          });
        });
      } else {
        $('a:contains("climate")').each(function () {
          const tempData = storeData($(this), base, name);
          tempData.map((el) => concatInfo.push(el));
        });
      }
      return concatInfo;
    }
    
    const cluster = await Cluster.launch({
      concurrency: Cluster.CONCURRENCY_CONTEXT,
      maxConcurrency: 2,

      puppeteerOptions: {
        headless: true,
        args: ["--no-sandbox", "--disable-setuid-sandbox"],
        userDataDir: "./tmp",
        defaultViewport: false,
      },
    });

    await cluster.task(async ({ page, data }) => {
      await page.goto(data.address);
      await page.waitForSelector("body");
      
      // console.log here prints that data.name is a Promise :(
      const elements = await getElementsCheerio(
        document.body.innerHTML,
        data.base, 
        data.name,
        wordsToSearch
      );
      result.push(elements);
    });

    newspapers.map((newspaper) => {
      console.log("queue" + newspaper); // This logs correctly: queue[object Object]
      cluster.queue(newspaper);
    });

    await cluster.idle();
    await cluster.close();

    // Display final object 
    res.json(result);
  })();
});

module.exports = router;

I don't get any errors, but on screen I get an empty [ ].我没有收到任何错误,但在屏幕上我得到一个空 [ ]。 Anyone can see what I am doing wrong here?任何人都可以在这里看到我做错了什么? :( :(

In general, it's an antipattern to mix Puppeteer with another selection library like Cheerio.通常,将 Puppeteer 与另一个选择库(如 Cheerio)混合使用是一种反模式 In addition to being redundant, the extra HTML parser doesn't work on the live document as Puppeteer does, so you have to snapshot the HTML at a particular moment with Puppeteer to capture it as a string and plug that string into Cheerio, where it's re-parsed back to a traversible tree structure.除了冗余之外,额外的 HTML 解析器不能像 Puppeteer 那样在实时文档上工作,因此您必须使用 Puppeteer 在特定时刻对 HTML 进行快照,以将其捕获为字符串并将该字符串插入 Cheerio,重新解析回可遍历的树结构。

Introducing this extra step creates opportunity for bugs and confusion to creep in, and that's what happened here.引入这个额外的步骤会为 bug 和混乱创造机会,这就是这里发生的事情。

The code代码

const elements = await getElementsCheerio(
    document.body.innerHTML,
    data.base, 
    data.name,
    wordsToSearch
);

is problematic.是有问题的。 document.body.innerHTML doesn't refer to anything related to Puppeteer. document.body.innerHTML不涉及与 Puppeteer 相关的任何内容。 Instead, use Puppeteer's await page.content() to snapshot the HTML.相反,使用 Puppeteer 的await page.content()来快照 HTML。

As a minor point, there's no need for Cheerio functions to be async , because they never use await .次要的一点是,Cheerio 函数不需要是async的,因为它们从不使用await It's a fully synchronous API.它是一个完全同步的 API。

Here's a minimal set up for using Cheerio with Puppeteer, assuming you accept the terms and conditions and are sure that intoducing this usually unnecessary layer of indirection is appropriate for your use case:这是将 Cheerio 与 Puppeteer 一起使用的最小设置,假设您接受条款和条件,并确定引入这个通常不必要的间接层适合您的用例:

const cheerio = require("cheerio"); // 1.0.0-rc.12
const puppeteer = require("puppeteer"); // ^19.0.0

let browser;
(async () => {
  browser = await puppeteer.launch();
  const [page] = await browser.pages();
  const url = "https://www.example.com";
  await page.goto(url, {waitUntil: "domcontentloaded"});
  const html = await page.content();
  const $ = cheerio.load(html);

  // do cheerio stuff synchronously
  console.log($("h1").text()); // => Example Domain
})()
  .catch(err => console.error(err))
  .finally(() => browser?.close());

It's basically the same for puppeteer-cluster: just drop the lines starting with const html = await page.content();它与 puppeteer-cluster 基本相同:只需删除以const html = await page.content();开头的行。 into the cluster.task callback that operates on page .进入在page上运行的cluster.task回调。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM