简体   繁体   中英

How to scrape <html>...</html> INSIDE another <html>...</html> with puppeteer

Alright, so the page I'm trying to scrape with node.js puppeteer is structured like this

    <html lang = "en">
    ....
       <html xmlns="https://www.w3.org/1999/xhtml" lang="en">
            <a href = "link I'm trying to go to">Go to link</a>
       </html>
    </html>

I tried to click by selector and XPath. Neither worked, and I triple checked that both were right. I feel like it has something to do with this embedded html, and I don't know how to handle it? Can anyone help?

Other comments pointed out that content inside an iframe are not accessible from the parent document. I checked the code again, and turns out it was actually structured like this:

<html lang = "en">
....
   <iframe src = "url">
       <html xmlns="https://www.w3.org/1999/xhtml" lang="en">
           <a href = "link I'm trying to go to">Go to link</a>
       </html>
   </iframe>
</html>

So all I had to do was page.goto(url), and then I could scrape as normal. Thanks everyone!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM