简体   繁体   English

请求(npm程序包)未返回完整的html

[英]Request (npm package) not returning full html

I'm using request to get the HTML of a web page. 我正在使用请求来获取网页的HTML。 When I do this for http://orangina.eu/ it only returns some of the HTML. 当我对http://orangina.eu/执行此操作时,它仅返回一些HTML。 I noticed that it is the same HTML that you get when you use "View Page Source" in Chrome versus the HTML you get when you use "Inspect", which is the full HTML. 我注意到,与在Chrome中使用“查看页面源代码”获得的HTML和使用“检查”(完整HTML)获得的HTML相同。 My guess is that it is getting the HTML before additional HTML is loaded via Javascript. 我的猜测是,它会在通过Javascript加载其他HTML之前获取HTML。 I reviewed the Request documentation and didn't see anything about this. 我查看了请求文档,对此一无所获。

Why is this happening and is there a way to get the full HTML (using Request or any other package)? 为什么会发生这种情况,有没有办法获取完整的HTML(使用Request或任何其他软件包)? Thanks. 谢谢。

Thanks, Andy. 谢谢,安迪。 Andy answered the question in the comment but I'll add it here so that the question is officially answered and add some more detail that I learned after following Andy's lead. 安迪(Andy)在评论中回答了该问题,但我将在此处添加它,以便对该问题进行正式回答,并添加一些我在跟随安迪(Andy)领导后学到的更多细节。 The NPM package Puppeteer solves this problem. NPM软件包Puppeteer解决了此问题。 It allows you to run a headless Chrome browser within your Node app. 它使您可以在Node应用程序中运行无头Chrome浏览器。

There's one thing that I learned when using Puppeteer to get the http://orangina.eu/ HTML that I want to share. 使用Puppeteer获取要共享的http://orangina.eu/ HTML时,我学到了一件事。 You'll notice that it takes the site a couple of seconds to load. 您会注意到,加载网站需要花费几秒钟的时间。 So if you use this code: 因此,如果您使用此代码:

const browser = await puppeteer.launch();

const page = await browser.newPage();
await page.goto('http://orangina.eu/');

console.log(await page.content());
await page.screenshot({path: 'screenshot.png'});

await browser.close();

you will get the same thing that I was getting with Request - a small portion of the eventual HTML. 您将获得与Request相同的东西-最终HTML的一小部分。 This is because they are both grabbing the HTML before it has loaded. 这是因为他们俩都在HTML加载之前就抓住了HTML。 Fortunately, Puppeteer has the option to wait before getting the content. 幸运的是,Puppeteer可以选择等待获取内容。 I looked to see see if Request has this and did not find anything. 我看了一下,看看Request是否有这个并且什么都没找到。 Here's the code that gets all of the HTML, notice the 5 second wait: 这是获取所有HTML的代码,请注意5秒钟的等待:

const browser = await puppeteer.launch();

const page = await browser.newPage();
await page.goto('http://orangina.eu/');
await page.waitFor(5000);

console.log(await page.content());
await page.screenshot({path: 'screenshot.png'});

await browser.close();

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM