简体   繁体   English

使用 Puppeteer 和无头 Chrome 获取 DOM 节点文本

[英]Getting DOM node text with Puppeteer and headless Chrome

I'm trying to use headless Chrome and Puppeteer to run our Javascript tests, but I can't extract the results from the page.我正在尝试使用无头 Chrome 和 Puppeteer 来运行我们的 Javascript 测试,但我无法从页面中提取结果。 Based on this answer , it looks like I should use page.evaluate() .基于这个答案,看起来我应该使用page.evaluate() That section even has an example that looks like what I need.该部分甚至有一个看起来像我需要的示例。

const bodyHandle = await page.$('body');
const html = await page.evaluate(body => body.innerHTML, bodyHandle);
await bodyHandle.dispose();

As a full example, I tried to convert that to a script that will extract my name from my user profile on Stack Overflow.作为一个完整的例子,我试图将其转换为一个脚本,该脚本将从我在 Stack Overflow 上的用户配置文件中提取我的名字。 Our project is using Node 6, so I converted the await expressions to use .then() .我们的项目使用的是 Node 6,所以我将await表达式转换为使用.then()

const puppeteer = require('puppeteer');

puppeteer.launch().then(function(browser) {
    browser.newPage().then(function(page) {
        page.goto('https://stackoverflow.com/users/4794').then(function() {
            page.$('h2.user-card-name').then(function(heading_handle) {
                page.evaluate(function(heading) {
                    return heading.innerText;
                }, heading_handle).then(function(result) {
                    console.info(result);
                    browser.close();
                }, function(error) {
                    console.error(error);
                    browser.close();
                });
            });
        });
    });
});

When I run that, I get this error:当我运行它时,我收到此错误:

$ node get_user.js 
TypeError: Converting circular structure to JSON
    at Object.stringify (native)
    at args.map.x (/mnt/data/don/git/Kive/node_modules/puppeteer/node6/helper.js:30:43)
    at Array.map (native)
    at Function.evaluationString (/mnt/data/don/git/Kive/node_modules/puppeteer/node6/helper.js:30:29)
    at Frame.<anonymous> (/mnt/data/don/git/Kive/node_modules/puppeteer/node6/FrameManager.js:376:31)
    at next (native)
    at step (/mnt/data/don/git/Kive/node_modules/puppeteer/node6/FrameManager.js:355:24)
    at Promise (/mnt/data/don/git/Kive/node_modules/puppeteer/node6/FrameManager.js:373:12)
    at fn (/mnt/data/don/git/Kive/node_modules/puppeteer/node6/FrameManager.js:351:10)
    at Frame._rawEvaluate (/mnt/data/don/git/Kive/node_modules/puppeteer/node6/FrameManager.js:375:3)

The problem seems to be with serializing the input parameter to page.evaluate() .问题似乎在于将输入参数序列化为page.evaluate() I can pass in strings and numbers, but not element handles.我可以传入字符串和数字,但不能传入元素句柄。 Is the example wrong, or is it a problem with Node 6?这个例子是错误的,还是 Node 6 有问题? How can I extract the text of a DOM node?如何提取 DOM 节点的文本?

I found three solutions to this problem, depending on how complicated your extraction is.我找到了三个解决这个问题的方法,这取决于你的提取有多复杂。 The simplest option is a related function that I hadn't noticed: page.$eval() .最简单的选项是一个我没有注意到的相关函数: page.$eval() It basically does what I was trying to do: combines page.$() and page.evaluate() .它基本上做我想做的事情:结合page.$()page.evaluate() Here's an example that works:这是一个有效的例子:

const puppeteer = require('puppeteer');

puppeteer.launch().then(function(browser) {
    browser.newPage().then(function(page) {
        page.goto('https://stackoverflow.com/users/4794').then(function() {
            page.$eval('h2.user-card-name', function(heading) {
                return heading.innerText;
            }).then(function(result) {
                console.info(result);
                browser.close();
            });
        });
    });
});

That gives me the expected result:这给了我预期的结果:

$ node get_user.js 
Don Kirkby top 2% overall

I wanted to extract something more complicated, but I finally realized that the evaluation function is running in the context of the page .我想提取一些更复杂的东西,但我终于意识到评估函数是在页面的上下文中运行 That means you can use any tools that are loaded in the page, and then just send strings and numbers back and forth.这意味着您可以使用页面中加载的任何工具,然后来回发送字符串和数字。 In this example, I use jQuery in a string to extract what I want:在此示例中,我在字符串中使用 jQuery 来提取我想要的内容:

const puppeteer = require('puppeteer');

puppeteer.launch().then(function(browser) {
    browser.newPage().then(function(page) {
        page.goto('https://stackoverflow.com/users/4794').then(function() {
            page.evaluate("$('h2.user-card-name').text()").then(function(result) {
                console.info(result);
                browser.close();
            });
        });
    });
});

That gives me a result with the whitespace intact:这给了我一个空白完整的结果:

$ node get_user.js 

                            Don Kirkby

                                top 2% overall

In my real script, I want to extract the text of several nodes, so I need a function instead of a simple string:在我的真实脚本中,我想提取几个节点的文本,所以我需要一个函数而不是一个简单的字符串:

const puppeteer = require('puppeteer');

puppeteer.launch().then(function(browser) {
    browser.newPage().then(function(page) {
        page.goto('https://stackoverflow.com/users/4794').then(function() {
            page.evaluate(function() {
                return $('h2.user-card-name').text();
            }).then(function(result) {
                console.info(result);
                browser.close();
            });
        });
    });
});

That gives the exact same result.这给出了完全相同的结果。 Now I need to add error handling, and maybe reduce the indentation levels.现在我需要添加错误处理,并可能减少缩进级别。

Using await/async and $eval , the syntax looks like the following:使用await/async$eval ,语法如下所示:

await page.goto('https://stackoverflow.com/users/4794')
const nameElement = await context.page.$eval('h2.user-card-name', el => el.text())
console.log(nameElement)

I use page.$eval我使用 page.$eval

const text = await page.$eval('h2.user-card-name', el => el.innerText );
console.log(text);

I had success using the following:我使用以下方法取得了成功:

const browser = await puppeteer.launch();
try {
  const page = await browser.newPage();
  await page.goto(url);
  await page.waitFor(2000);
  let html_content = await page.evaluate(el => el.innerHTML, await page.$('.element-class-name'));
  console.log(html_content);
} catch (err) {
  console.log(err);
}

Hope it helps.希望能帮助到你。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM