How to scrape <html>...</html> INSIDE another <html>...</html> with puppeteer

Question

Alright, so the page I'm trying to scrape with node.js puppeteer is structured like this

    <html lang = "en">
    ....
       <html xmlns="https://www.w3.org/1999/xhtml" lang="en">
            <a href = "link I'm trying to go to">Go to link</a>
       </html>
    </html>

I tried to click by selector and XPath. Neither worked, and I triple checked that both were right. I feel like it has something to do with this embedded html, and I don't know how to handle it? Can anyone help?

Answer 1

Other comments pointed out that content inside an iframe are not accessible from the parent document. I checked the code again, and turns out it was actually structured like this:

<html lang = "en">
....
   <iframe src = "url">
       <html xmlns="https://www.w3.org/1999/xhtml" lang="en">
           <a href = "link I'm trying to go to">Go to link</a>
       </html>
   </iframe>
</html>

So all I had to do was page.goto(url), and then I could scrape as normal. Thanks everyone!

How to scrape <html>...</html> INSIDE another <html>...</html> with puppeteer

Question

1 answers

solution1
0 2019-12-08 18:44:34

How to scrape <html>...</html> INSIDE another <html>...</html> with puppeteer

Question

1 answers

solution1 0 2019-12-08 18:44:34

solution1
0 2019-12-08 18:44:34