简体   繁体   English

如何使用 Puppeteer 返回不在页面源中的元素

[英]How to return an element that isn't in the page source using Puppeteer

I'm trying to return some information from a page using the following code to select a page element and return some values within it:我正在尝试使用以下代码将页面中的一些信息返回到 select 页面元素并在其中返回一些值:

const puppeteer = require('puppeteer');

function run (numberOfPages) {
    return new Promise(async (resolve, reject) => {
        try {
            if (!numberOfPages) {
                numberOfPages = 1;
            }
            const browser = await puppeteer.launch();
            const page = await browser.newPage();
            await page.setRequestInterception(true);
            page.on('request', (request) => {
                if (request.resourceType() === 'document') {
                    request.continue();
                } else {
                    request.abort();
                }
            });
            await page.goto('https://careers.google.com/jobs/results/');
            let currentPage = 1;
            let urls=[];
            while (currentPage <= numberOfPages) {
                await page.waitForSelector('a.gc-card');
                let newUrls = await page.evaluate(() => {
                    let results = [];
                    let items = document.querySelectorAll('a.gc-card');
                    items.forEach((item) => {
                        results.push({
                            jobTitle: item.innerText,
                            url: item.getAttribute('href')
                        });
                    });
                    return results;
                });
                urls = urls.concat(newUrls);
                if (currentPage < pagesToScrape) {
                    await Promise.all([
                        await page.waitForSelector('a.gc-link gc-link--on-grey gc-action-group__item gc-h-larger-tap-target'),
                        await page.click('a.gc-link gc-link--on-grey gc-action-group__item gc-h-larger-tap-target'),
                        await page.waitForSelector('a.gc-link gc-link--on-grey gc-action-group__item gc-h-larger-tap-target')
                    ])
                }
                currentPage++;
                await page.waitFor(500);
            }
            browser.close();
            return resolve(urls);
        } catch (e) {
            return reject(e);
        }
    })
}
run(1).then(console.log).catch(console.error);

I can see using inspect in dev tools that the class gc-card is present in the dom when the page is loaded but for some reason await page.waitForSelector('a.gc-card');我可以看到在开发工具中使用检查,当页面加载时 class gc-card存在于 dom 中,但由于某种原因await page.waitForSelector('a.gc-card'); times out every time I run the code.每次我运行代码时都会超时。 Not totally sure the reason for this, but think it could be something to do with the majority of the page body of the page being loaded through a script.不完全确定原因,但认为这可能与通过脚本加载的页面的大部分页面主体有关。

The desired outcome is to return an array with all the job titles and URLs on the page.期望的结果是返回一个包含页面上所有职位和 URL 的数组。

You request even is aborting all the javascript files the site needs to run.您甚至要求中止站点需要运行的所有 javascript 文件。

page.on('request', (request) => {
   if (request.resourceType() === 'document') {
      request.continue();
   } else {
      request.abort();
   }
});

Instead of allowing only a document think in a negative way and stop the requests you are sure you won't need.而不是只允许一个文档以消极的方式思考并停止您确定不需要的请求。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM