简体   繁体   中英

puppeteer Get array of href then iterate through each href and the hrefs on that page

I'm trying to scrape data via puppeteer in node.js

Currently, I'm looking to write a script, that scrapes all the data within a certain section of well.ca

Right now, here's my methodology/logic that I'm trying to implement via node.js

1 - head to Medicine health section of site

2 - use dom selector to get an array of hrefs from .panel-body-content via dom selector panel-body-content a[href] to scrape sub-sections

3 - iterate through each link (subsection) with a for loop

4 For each subsection link, get another array of hrefs for each product by getting href for each class with value col-lg-5ths col-md-3 col-sm-4 col-xs-6 via .col-lg-5ths col-md-3 col-sm-4 col-xs-6 a[href]

5 - loop through each of the products within subsection

6 - scrape data for each product

Currently, i've written most of the above code:

const puppeteer = require('puppeteer');
const chromeOptions = {
  headless: false,
  defaultViewport: null,
};
(async function main() {
  const browser = await puppeteer.launch(chromeOptions);
  try {
    const page = await browser.newPage();
    await page.goto("https://well.ca/categories/medicine-health_2.html");
    console.log("::::::: OPEN WELL   ::::::::::");

    // href attribute
    const hrefs1 = await page.evaluate(
      () => Array.from(
        document.querySelectorAll('.panel-body-content a[href]'),
       a => a.getAttribute('href')
     )
   );
    
    console.log(hrefs1);

    const urls = hrefs1

    for (let i = 0; i < urls.length; i++) {
      const url = urls[i];
      await page.goto(url);
    }
      const hrefs2 = await page.evaluate(
     () => Array.from(
      document.querySelectorAll('.col-lg-5ths col-md-3 col-sm-4 col-xs-6 a[href]'),
       a => a.getAttribute('href')
     )
    );

When I attempt to get an array for each href for every product, I receive nothing in the array.

How can I add a nested for loop, to get an array of all the hrefs for every product in every subsection and then visit each product link?

What is the correct dom selector for getting all the hrefs that are within class .col-lg-5ths col-md-3 col-sm-4 col-xs-6 with id product_grid_link

and if I wanted to add a subsequent loop to grab information from each product via the href of the product from each subsection, how can I embed that into the code?

Any help would be much appreciated

It seems some links are duplicated so it would be better to collect all the links of final pages, dedupe the link list and then to scrape the final pages. (You can also save the links of the final pages in a file to use it later.) This script collects 5395 links (deduped).

'use strict';

const puppeteer = require('puppeteer');

(async function main() {
  try {
    const browser = await puppeteer.launch({ headless: false, defaultViewport: null });
    const [page] = await browser.pages();

    await page.goto('https://well.ca/categories/medicine-health_2.html');

    const hrefsCategoriesDeduped = new Set(await page.evaluate(
      () => Array.from(
        document.querySelectorAll('.panel-body-content a[href]'),
        a => a.href
      )
    ));

    const hrefsPages = [];

    for (const url of hrefsCategoriesDeduped) {
      await page.goto(url);
      hrefsPages.push(...await page.evaluate(
        () => Array.from(
          document.querySelectorAll('.col-lg-5ths.col-md-3.col-sm-4.col-xs-6 a[href]'),
          a => a.href
        )
      ));
    }

    const hrefsPagesDeduped = new Set(hrefsPages);

    // hrefsPagesDeduped can be converted back to an array
    // and saved in a JSON file now if needed.

    for (const url of hrefsPagesDeduped) {
      await page.goto(url);

      // Scrape the page.
    }

    await browser.close();
  } catch (err) {
    console.error(err);
  }
})();

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM