简体   繁体   中英

JSDOM: dom.window.document.innerHTML is undefined

I am creating a node.js script to parse the content from a website. Before I work with the returned HTML I want to remove a few elements and properties. However, when trying to retrieve the HTML from jsdom I am only returned undefined . This appears to happen before I make my modifications to the HTML. How can I use jsdom to modify the HTML and return it?

const jsdom = require('jsdom');
...
var htmlString = `<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"><html lang=en><head>...`
paresHTML(htmlString);

function parseHTML(htmlString) {
    const dom = new jsdom.JSDOM(htmlString);

    console.log(dom.window.document.innerHTML); // This returns undefined

    dom.window.document.querySelectorAll('script').forEach(element => element.remove());
    dom.window.document.querySelectorAll('head').forEach(element => element.remove());
    dom.window.document.querySelectorAll('link').forEach(element => element.remove());
    dom.window.document.querySelectorAll('style').forEach(element => element.remove());
    dom.window.document.querySelectorAll('iframe').forEach(element => element.remove());

    dom.window.document.querySelectorAll('noscript').forEach((element) => {
        var replacement = dom.window.document.createElement('div');
        replacement.setAttribute('class', 'noscript');
        replacement.innerHTML = element.innerHTML;
        element.parentNode.replaceChild(replacement, element);
    });

    dom.window.document.querySelectorAll('img[src]').forEach((element) => {
        const src = element.getAttribute('src');
        element.setAttribute('data-src', src);
        element.removeAttribute('src');
    });

    dom.window.document.querySelectorAll('[style]').forEach((element) => {
        element.removeAttribute('style');
    });

    return dom.window.document.innerHTML; // This also returns undefined
}

Just like on the front-end, the document has no innerHTML property:

 console.log(document.innerHTML);

However, the document.documentElement does have it:

 console.log(document.documentElement.innerHTML);

JSDom works the same way. Adding .documentElement to the document accesses, eg

console.log(dom.window.document.documentElement.innerHTML);

results in:

<head></head><body>...</body>

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM