简体   繁体   中英

Run JavaScript in clean chrome/puppeteer context

I am attempting to run JavaScript on a page context with content scraping as the goal. With puppeteer, I can easily call evaluate() and run a piece of JavaScript inside the context of the page. So I basically just evaluate a document.querySelector on the page:

const puppeteer = require('puppeteer');
const url = 'file:///C:/Users/roel/puppettest/index.html';

puppeteer.launch({headless: false}).then(async browser => {
    const page = await browser.newPage();
    await page.goto(url, {waitUntil: 'domcontentloaded'});
    const value = await page.evaluate(() => document.querySelector('div').textContent);
    if (value === 'Hello') {
        console.log('Works');
    } else {
        console.log('Nope :-(');
    }
});

And this is the page that I referred to:

<html>
    <body>
        <div>Hello</div>
        <script>
            var div = document.createElement('div');
            div.textContent = 'Whooh!';
            document.body.appendChild(div);
            document.querySelector = null;
        </script>
    </body>
</html>

So here's the problem: The code I evaluated runs a document.querySelector , but the page that I loaded set that to null . Chaos ensues. So... I want to ensure that the JavaScript that I run is ran on a clean context .

First Idea:

I can just retrieve the generated HTML and create a new JavaScript context around the DOM. Run a page.content() to retrieve the HTML and... Oh, it's not the current HTML, it's the initial HTML (eg the document.createElement() didn't execute). Running a page.evaluate(() => document.body.innerHTML) would work assuming the page didn't add a Object.defineProperty on body property of document . But there is no such guarantee. Is there a way to retrieve the current HTML without touching the JS context?

Second Idea:

Chrome extensions run in their own JavaScript context with access to the DOM, and only the DOM. Which is exactly what I'm after. Looking over the puppeteer documentation, there is no indication to creating such a context within puppeteer itself. Or is there and I missed it?

...

So how do I go about getting a clean JS context to run queries against?

EDIT I read the output of .content() wrong. The HTML is included. So, the first idea does work. I'm still curious if the second idea is achieveable as it is much preferred.

You can use .content() to retrieve the HTML at the current point in time. The pre-edit question wrongly assumed that .content() returned the original. Running this resulting HTML into jsdom allows you to execute JS on the DOM without being affected by the original page.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM