[英]Why does headless need to be false for Puppeteer to work?
I'm creating a web api that scrapes a given url and sends that back.我正在创建一个 Web api,它会抓取给定的 url 并将其发回。 I am using Puppeteer to do this.
我正在使用 Puppeteer 来执行此操作。 I asked this question: Puppeteer not behaving like in Developer Console<\/a>
我问了这个问题:
Puppeteer not behavior like in Developer Console<\/a>
express()
.get("/*", (req, res) => {
global.notBaseURL = req.params[0];
(async () => {
const browser = await puppet.launch({ headless: false }); // Line of Interest
const page = await browser.newPage();
console.log(req.params[0]);
await page.goto(req.params[0], { waitUntil: "networkidle2" }); //this is the url
title = await page.$eval("title", (el) => el.innerText);
browser.close();
res.send({
title: title,
});
})();
})
.listen(PORT, () => console.log(`Listening on ${PORT}`));
The reason it might work in UI mode but not headless is that sites who aggressively fight scraping will detect that you are running in a headless browser.它可能在 UI 模式下工作但不能在无头模式下工作的原因是积极打击抓取的网站会检测到您正在运行无头浏览器。
Some possible workarounds:一些可能的解决方法:
puppeteer-extra
puppeteer-extra
Found here: https://github.com/berstend/puppeteer-extra Check out their docs for how to use it.在这里找到: https : //github.com/berstend/puppeteer-extra查看他们的文档以了解如何使用它。 It has a couple plugins that might help in getting past headless-mode detection:
它有几个插件可能有助于通过无头模式检测:
puppeteer-extra-plugin-anonymize-ua
-- anonymizes your User Agent. puppeteer-extra-plugin-anonymize-ua
-- 匿名化您的用户代理。 Note that this might help with getting past headless mode detection, but as you'll see if you visit https://amiunique.org/ it is unlikely to be enough to keep you from being identified as a repeat visitor.puppeteer-extra-plugin-stealth
-- this might help win the cat-and-mouse game of not being detected as headless. puppeteer-extra-plugin-stealth
这可能有助于赢得不被检测为无头的猫捉老鼠游戏。 There are many tricks that are employed to detect headless mode, and as many tricks to evade them. It's possible to run a single browser UI in a manner that let's you attach puppeteer to that running instance.可以通过将 puppeteer 附加到正在运行的实例的方式运行单个浏览器 UI。 Here's an article that explains it: https://medium.com/@jaredpotter1/connecting-puppeteer-to-existing-chrome-window-8a10828149e0
这是一篇解释它的文章: https : //medium.com/@jaredpotter1/connecting-puppeteer-to-existing-chrome-window-8a10828149e0
Essentially you're starting Chrome or Chromium (or Edge?) from the command line with --remote-debugging-port=9222
(or any old port?) plus other command line switches depending on what environment you're running it in. Then you use puppeteer to connect to that running instance instead of having it do the default behavior of launching a headless Chromium instance: const browser = await puppeteer.connect({ browserURL: ENDPOINT_URL });
本质上,您是从命令行使用
--remote-debugging-port=9222
(或任何旧端口?)以及其他命令行开关启动 Chrome 或 Chromium(或 Edge?),具体取决于您在什么环境中运行它。然后,您使用 puppeteer 连接到该正在运行的实例,而不是让它执行启动无头 Chromium 实例的默认行为: const browser = await puppeteer.connect({ browserURL: ENDPOINT_URL });
. . Read the puppeteer docs here for more info: https://pptr.dev/#?product=Puppeteer&version=v5.2.1&show=api-puppeteerlaunchoptions
在此处阅读 puppeteer 文档以获取更多信息: https ://pptr.dev/#?product=Puppeteer&version=v5.2.1&show=api-puppeteerlaunchoptions
The ENDPOINT_URL
is displayed in the terminal when you launch the browser from the command line with the --remote-debugging-port=9222
option.当您使用
--remote-debugging-port=9222
选项从命令行启动浏览器时,终端中会显示ENDPOINT_URL
。
This option is going to require some server/ops mojo, so be prepared to do a lot more Stack Overflow searches.这个选项需要一些服务器/操作 mojo,所以准备做更多的 Stack Overflow 搜索。 :-)
:-)
There are other strategies I'm sure but those are the two I'm most familiar with.我确定还有其他策略,但这是我最熟悉的两种。 Good luck!
祝你好运!
Todd's answer is thorough, but worth trying before resorting to some of the recommendations there is to slap on the following user agent line pulled from the relevant Puppeteer GitHub issue Different behavior between { headless: false } and { headless: true } :托德的回答是彻底的,但在诉诸一些建议之前值得尝试一下,从相关的 Puppeteer GitHub 问题中提取以下用户代理行{ headless: false } 和 { headless: true } 之间的不同行为:
await page.setUserAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36");
await page.goto(yourURL);
Now, the Nordstorm site provided by OP seems to be able to detect robots even with headless: false
, at least at the present moment.现在,OP 提供的 Nordstorm 站点似乎能够检测到带有
headless: false
的机器人,至少目前是这样。 But other sites are less strict and I've found the above line to be useful on some of them.但是其他站点不那么严格,我发现上面的行对其中一些站点很有用。
Visit the GH issue thread above for other ideas.访问上面的 GH 问题线程以获取其他想法。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.