简体   繁体   中英

Js puppeteer how to scrape all rows from crypto table

I am creating this scraper to collect all the link from this crypto site> I am very new to puppeteer, so I don't know much, but I decided to use it for its increased speed over selenium. I have been able to collect the first 15 or so links, from a table of 100 rows and 9 pages. I don't understand why the scraper is missing so many rows because more than 15 are showing when you first load.

async function run () {
        try {
            const browser = await puppeteer.launch({headless: false, defaultViewport: null, args: ['--start-maximized']});
            const page = await browser.newPage();
            await page.goto("https://coinmarketcap.com/");
 const grabedTableLinks = await page.evaluate(() => {
            const aTags = Array.from(document.querySelectorAll('table.cmc-table tbody tr td div.sc-16r8icm-0.escjiH a.cmc-link'))
            return aTags.map(a=>({href:a.getAttribute('href')}))
})
return grabedTableLinks
            
        } catch (e) {
            return e;
        }
}
run().then(console.log).catch(console.error);

To sum it all up, this is what I have so far, and it's only able to scrape the first 15 links in the table. I need to scrape this specific link from all table rows.

The problem seems to be the data is not fully loaded at the at the first load, it's being loaded gradually.

By the page structure I noticed the elements that contain the .sc-16r8icm-0.escjiH class are the ones not fully loaded, but if you dig deeper on all of td you will notice the links are already there not matter if the row is fully loaded or not, so you can do something like this to get the link you need:

document.querySelectorAll("table.cmc-table tbody tr td:nth-child(3) a.cmc-link")

The td:nth-child(3) will access to right position of the link

And your code will look like this:

async function run() {
  try {
    const browser = await puppeteer.launch({
      headless: false,
      defaultViewport: null,
      args: ["--start-maximized"],
    });
    const page = await browser.newPage();
    await page.goto("https://coinmarketcap.com/");
    const grabedTableLinks = await page.evaluate(() => {
      const aTags = Array.from(
        document.querySelectorAll(
          "table.cmc-table tbody tr td:nth-child(3) a.cmc-link"
        )
      );
      return aTags.map((a) => ({ href: a.getAttribute("href") }));
    });
    return grabedTableLinks;
  } catch (e) {
    return e;
  }
}
run().then(console.log).catch(console.error);

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM