简体   繁体   English

HTML 无法进入节点 js puppeteer

[英]HTML not get in node js puppeteer

Error错误

Cannnot read property 'querySelectorAll' of null无法读取 null 的属性“querySelectorAll”

I am scraping this site , when I write the below lines in console it gives me the HTML.我正在抓取这个网站,当我在控制台中编写以下行时,它给了我 HTML。 But when I scrape the HTML from puppeteer it gives me error但是当我从 puppeteer 那里刮掉 HTML 它给了我错误

document.querySelectorAll('#stroke-play-container > .stroke-play-leaderboard > .the-leaderboard.with-rolex > table.leaderboard.leaderboard-table.large')[0].nextSibling;

Code代码

'use strict';

 const puppeteer = require('puppeteer');
 function run() {
 return new Promise(async (resolve, reject) => {
    try {


        const browser = await puppeteer.launch({
        headless : false
        });

        const page = await browser.newPage();

        await page.goto("https://www.pgatour.com/leaderboard.html");

        await page.evaluate(`window.scrollTo(0, document.body.scrollHeight)`);
        await page.waitFor(5000);
    
        let urls = await page.evaluateHandle(() => {
            let results = [];
            var parser = new DOMParser();
            
            var node = document.querySelectorAll('#stroke-play-container > .stroke-play-leaderboard > .the-leaderboard.with-rolex > table.leaderboard.leaderboard-table.large')[0].nextSibling;
           
            if(node){

            var $ = parser.parseFromString(node, 'text/html');
            
          
            return {
                name: $.querySelectorAll('table > tbody:nth-child(1) > tr > td.player-name > div > div.player-name-col').innerText
            };
            }
            else{
                return 'error';
            }

        })
        browser.close();
        return resolve(urls);
    } catch (e) {
        return reject(e);
    }
})
}
 run().then(console.log).catch(console.error);

Try it like this:试试这样:

let names = await page.evaluate(() => {
  let css = '.the-leaderboard.with-rolex > table.leaderboard.leaderboard-table.large + div div.player-name-col'
  let divs = [...document.querySelectorAll(css)]
  return divs.map(div => div.innerText)
})

I'm not sure what you were trying to accomplish with DOMParser, you shouldn't ever need to use that.我不确定你想用 DOMParser 完成什么,你不应该使用它。

EDIT: as pointed out in the comments, please be mindful of the Terms of Service of pgatours.com, which do not allow for scraping, crawling etc. The below solution is only intended to illustrate how to solve the generic technical point behind your question.编辑:正如评论中所指出的,请注意 pgatours.com 的服务条款,它不允许抓取、爬行等。以下解决方案仅用于说明如何解决您问题背后的一般技术点.

I think this might be due to the default viewport size Puppeteer is using.我认为这可能是由于 Puppeteer 使用的默认视口大小。 The website is hiding the content you are looking for on smaller resolutions, hence the problem.该网站以较小的分辨率隐藏了您正在寻找的内容,因此出现了问题。

What made this work for me was specifying the viewport size explicitly, like so:使这项工作对我有用的是明确指定视口大小,如下所示:

page.setViewport({ width: 1200, height: 1000 })

So your code would become:所以你的代码会变成:

'use strict';

 const puppeteer = require('puppeteer');
 function run() {
 return new Promise(async (resolve, reject) => {
    try {


        const browser = await puppeteer.launch({
        headless : false
        });

        const page = await browser.newPage();
        page.setViewport({ width: 1200, height: 1000 })


        await page.goto("https://www.pgatour.com/leaderboard.html");

        await page.evaluate(`window.scrollTo(0, document.body.scrollHeight)`);
        await page.waitFor(5000);
    
        let urls = await page.evaluateHandle(() => {
            let results = [];
            var parser = new DOMParser();
            
            var node = document.querySelectorAll('#stroke-play-container > .stroke-play-leaderboard > .the-leaderboard.with-rolex > table.leaderboard.leaderboard-table.large')[0].nextSibling;
           
            if(node){

            var $ = parser.parseFromString(node, 'text/html');
            
          
            return {
                name: $.querySelectorAll('table > tbody:nth-child(1) > tr > td.player-name > div > div.player-name-col').innerText
            };
            }
            else{
                return 'error';
            }

        })
        browser.close();
        return resolve(urls);
    } catch (e) {
        return reject(e);
    }
})
}
run().then(console.log).catch(console.error);

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM