简体   繁体   中英

Scrape nested page puppeteer

I would like to know how to scrape data located in nested pages. Here's an example I tried to build however couldn't make it work. The idea is to go to https://dev.to/ , click the question and grab its title. Then go back and redo the process for the next question.

const puppeteer = require("puppeteer");

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto("https://dev.to/");

  try {
    const selectors = await page.$$(".crayons-story > a");

    for (const post of selectors) {
      await Promise.all([
        page.waitForNavigation(),
        post.click(),
        page.goBack(),
      ]);
    }
  } catch (error) {
    console.log(error);
  } finally {
    browser.close();
  }
})();

When I run this code, I get Error: Node is either not visible or not an HTMLElement

Edit: The code is missing a piece where grabs the title, but is enough for the purpose.

What is happening is the website doesn't automatically have that node when the page is opened. However, puppeteer fetches the webcontents immediately after going to that page. What you'll need is a delay so that the website is able to use it's "script" tags and inject the story in.

To wait, use this following command:

await page.waitForSelector(".crayons-story > a")

This makes sure puppeteer waits for that selector to become visible, and then starts scraping the contents.

So your final code should look like this:

const puppeteer = require("puppeteer");

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto("https://dev.to/");
  
  await page.waitForSelector(".crayons-story > a")
  try {
    const selectors = await page.$$(".crayons-story > a");

    for (const post of selectors) {
      await Promise.all([
        page.waitForNavigation(),
        post.click(".crayons-story > a"),
        page.goBack(),
      ]);
    }
  } catch (error) {
    console.log(error);
  } finally {
    browser.close();
  }
})();

The problem I'm facing here is very similar to this one. Puppeteer Execution context was destroyed, most likely because of a navigation

The best solution I could come up with is to avoid using page.goBack() and rather use page.goto() so the references are not lost.

Solution 1: (this one uses MAP and the scrape is resolved in an async way, much quicker than the one bellow this one):

const puppeteer = require("puppeteer");

const SELECTOR_POSTS_LINK = ".article--post__title > a";
const SELECTOR_POST_TITLE = ".article-header--title";

async function scrape() {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto("https://www.smashingmagazine.com/articles/");

  try {
    const links = await page.$$eval(SELECTOR_POSTS_LINK, (links) => links.map((link) => link.href));

    const resolver = async (link) => {
      await page.goto(link);
      const title = await page.$eval(SELECTOR_POST_TITLE, (el) => el.textContent);

      return { title };
    };

    const promises = await links.map((link) => resolver(link));
    const articles = await Promise.all(promises);

    console.log(articles);
  } catch (error) {
    console.log(error);
  } finally {
    browser.close();
  }
}

scrape();

Solution 2: (Use for of so it's sync and then much slower than the previous):

const puppeteer = require("puppeteer");

const SELECTOR_POSTS_LINK = ".article--post__title > a";
const SELECTOR_POST_TITLE = ".article-header--title";

async function scrape() {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto("https://www.smashingmagazine.com/articles/");

  try {
    const links = await page.$$eval(SELECTOR_POSTS_LINK, (links) => links.map((link) => link.href));

    const articles = [];
    for (const link of links) {
      await page.goto(link);
      const title = await page.$eval(SELECTOR_POST_TITLE, (el) => el.textContent);
      articles.push({ title });
    }
    console.log(articles);
  } catch (error) {
    console.log(error);
  } finally {
    browser.close();
  }
}

scrape();

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM