简体   繁体   中英

JSDOM: Access divs inside iframe

I try to scrape some info from site http://www.example.com that has the following html:

<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<title>My site</title>
</head>
<body>
<div id="one">
    <div>
        <iframe>
           <!DOCTYPE html>
           <html>
           <head>
             <meta charset="utf-8">
             <title>My site</title>
           </head>
           <body>
             <div id="hello">
               <a href="http://example.net/somepage"><img src="http://example.net/dokuro_chan.jpg"></a>
             </div>
           </body>
           </html>
        </iframe>
    </div>
</div>
<div id="two">
    <div>
        <iframe>
           <!DOCTYPE html>
           <html>
           <head>
             <meta charset="utf-8">
             <title>My site</title>
           </head>
           <body>
             <div id="hello">
               <a href="http://example.net/somepage2"><img src="http://example.net/dokuro_chan2.jpg"></a>
             </div>
           </body>
           </html>
        </iframe>
    </div>
</div>
</body>
</html>

Then I try to scrape the iframe content via nodejs using jsdom:

const jsdom = require("jsdom");
const { JSDOM } = jsdom;

JSDOM.fromURL("http://www.example.com",{
        resources: "usable",
        runScripts: "dangerously"
}).then(dom =>{

        const divIds=["#one","#two"]

        divIds.forEach((divId)=> {
            const selector=googleAdSelector(divId)
            const iframe=dom.window.document.querySelector(selector)
            console.log("Iframe Object", iframe)
        })
        // callback(null,dom)
})

const googleAdSelector=function(divId){
        return divId+" > div > iframe";
}

What I want tyo try to acheive is to get the href and the src content that is inside the iframes.

But for some reason the output is:

Iframe Object null

Iframe Object null

Do you have any idea hot how access the html INSIDE the iframe?

You need to approach it differently. Just using a headless browser manually fetch the data through network during the page load and process it separately.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM