简体   繁体   English

"为什么 Puppeteer 工作的 headless 需要是 false 的?"

[英]Why does headless need to be false for Puppeteer to work?

I'm creating a web api that scrapes a given url and sends that back.我正在创建一个 Web api,它会抓取给定的 url 并将其发回。 I am using Puppeteer to do this.我正在使用 Puppeteer 来执行此操作。 I asked this question: Puppeteer not behaving like in Developer Console<\/a>我问了这个问题: Puppeteer not behavior like in Developer Console<\/a>

and recieved an answer that suggested it would only work if headless was set to be false.并收到一个答案,表明它只有在 headless 设置为 false 时才有效。 I don't want to be constantly opening up a browser UI i don't need (I just the need the data!) so I'm looking for why headless has to be false and can I get a fix that lets headless = true.我不想经常打开一个我不需要的浏览器 UI(我只需要数据!)所以我正在寻找为什么 headless 必须是 false 并且我可以得到一个让 headless = true 的修复.

Here's my code:这是我的代码:

express()
  .get("/*", (req, res) => {
    global.notBaseURL = req.params[0];
    (async () => {
      const browser = await puppet.launch({ headless: false }); // Line of Interest
      const page = await browser.newPage();
      console.log(req.params[0]);
      await page.goto(req.params[0], { waitUntil: "networkidle2" }); //this is the url
      title = await page.$eval("title", (el) => el.innerText);

      browser.close();

      res.send({
        title: title,
      });
    })();
  })
  .listen(PORT, () => console.log(`Listening on ${PORT}`));

The reason it might work in UI mode but not headless is that sites who aggressively fight scraping will detect that you are running in a headless browser.可能在 UI 模式下工作但不能在无头模式下工作的原因是积极打击抓取的网站会检测到您正在运行无头浏览器。

Some possible workarounds:一些可能的解决方法:

Use puppeteer-extra使用puppeteer-extra

Found here: https://github.com/berstend/puppeteer-extra Check out their docs for how to use it.在这里找到: https : //github.com/berstend/puppeteer-extra查看他们的文档以了解如何使用它。 It has a couple plugins that might help in getting past headless-mode detection:它有几个插件可能有助于通过无头模式检测:

  1. puppeteer-extra-plugin-anonymize-ua -- anonymizes your User Agent. puppeteer-extra-plugin-anonymize-ua -- 匿名化您的用户代理。 Note that this might help with getting past headless mode detection, but as you'll see if you visit https://amiunique.org/ it is unlikely to be enough to keep you from being identified as a repeat visitor.请注意,这可能有助于通过无头模式检测,但正如您在访问https://amiunique.org/时所看到的那样,它不太可能足以防止您被识别为重复访问者。
  2. puppeteer-extra-plugin-stealth -- this might help win the cat-and-mouse game of not being detected as headless. puppeteer-extra-plugin-stealth这可能有助于赢得不被检测为无头的猫捉老鼠游戏。 There are many tricks that are employed to detect headless mode, and as many tricks to evade them.有很多技巧可以用来检测无头模式,也有很多技巧可以避开它们。

Run a "real" Chromium instance/UI运行“真正的”Chromium 实例/UI

It's possible to run a single browser UI in a manner that let's you attach puppeteer to that running instance.可以通过将 puppeteer 附加到正在运行的实例的方式运行单个浏览器 UI。 Here's an article that explains it: https://medium.com/@jaredpotter1/connecting-puppeteer-to-existing-chrome-window-8a10828149e0这是一篇解释它的文章: https : //medium.com/@jaredpotter1/connecting-puppeteer-to-existing-chrome-window-8a10828149e0

Essentially you're starting Chrome or Chromium (or Edge?) from the command line with --remote-debugging-port=9222 (or any old port?) plus other command line switches depending on what environment you're running it in. Then you use puppeteer to connect to that running instance instead of having it do the default behavior of launching a headless Chromium instance: const browser = await puppeteer.connect({ browserURL: ENDPOINT_URL });本质上,您是从命令行使用--remote-debugging-port=9222 (或任何旧端口?)以及其他命令行开关启动 Chrome 或 Chromium(或 Edge?),具体取决于您在什么环境中运行它。然后,您使用 puppeteer 连接到该正在运行的实例,而不是让它执行启动无头 Chromium 实例的默认行为: const browser = await puppeteer.connect({ browserURL: ENDPOINT_URL }); . . Read the puppeteer docs here for more info: https://pptr.dev/#?product=Puppeteer&version=v5.2.1&show=api-puppeteerlaunchoptions在此处阅读 puppeteer 文档以获取更多信息: https ://pptr.dev/#?product=Puppeteer&version=v5.2.1&show=api-puppeteerlaunchoptions

The ENDPOINT_URL is displayed in the terminal when you launch the browser from the command line with the --remote-debugging-port=9222 option.当您使用--remote-debugging-port=9222选项从命令行启动浏览器时,终端中会显示ENDPOINT_URL

This option is going to require some server/ops mojo, so be prepared to do a lot more Stack Overflow searches.这个选项需要一些服务器/操作 mojo,所以准备做更多的 Stack Overflow 搜索。 :-) :-)

There are other strategies I'm sure but those are the two I'm most familiar with.我确定还有其他策略,但这是我最熟悉的两种。 Good luck!祝你好运!

Todd's answer is thorough, but worth trying before resorting to some of the recommendations there is to slap on the following user agent line pulled from the relevant Puppeteer GitHub issue Different behavior between { headless: false } and { headless: true } :托德的回答是彻底的,但在诉诸一些建议之前值得尝试一下,从相关的 Puppeteer GitHub 问题中提取以下用户代理行{ headless: false } 和 { headless: true } 之间的不同行为

await page.setUserAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36");
await page.goto(yourURL);

Now, the Nordstorm site provided by OP seems to be able to detect robots even with headless: false , at least at the present moment.现在,OP 提供的 Nordstorm 站点似乎能够检测到带有headless: false的机器人,至少目前是这样。 But other sites are less strict and I've found the above line to be useful on some of them.但是其他站点不那么严格,我发现上面的行对其中一些站点很有用。

Visit the GH issue thread above for other ideas.访问上面的 GH 问题线程以获取其他想法。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM