简体   繁体   English

单个 URL 中的 Web Scrape 分页(cheerio 和 axios)

[英]Web Scrape pagination in a single URL (cheerio and axios)

newbie here.新手在这里。 I was on web scraping project.我正在进行网络抓取项目。 And I wanted some guide on web scraping pagination technique.我想要一些关于网页抓取分页技术的指南。 I'm scraping this site https://www.imoney.my/unit-trust-investments .我正在抓取这个网站https://www.imoney.my/unit-trust-investments As you can see ,I wanted to retrieve different "Total return" percentage based on Xyears.如您所见,我想根据 X 年检索不同的“总回报”百分比。 Right now I'm using cheerio and axios.现在我正在使用cheerio和axios。

const http = require("http");
const axios = require("axios");
const cheerio = require("cheerio");

http
    .createServer(async function (_, res) {
        try {
            const response = await axios.get(
                "https://www.imoney.my/unit-trust-investments"
            );

            const $ = cheerio.load(response.data);

            const funds = [];
            $("[class='list-item']").each((_i, row) => {
                const $row = $(row);

                const fund = $row.find("[class*='product-title']").find("a").text();
                const price = $row.find("[class*='is-narrow product-profit']").find("b").text();
                const risk = $row.find("[class*='product-title']").find("[class*='font-xsm extra-info']").text().replace('/10','');;
                const totalreturn = $row.find("[class*='product-return']").find("[class='font-lg']").find("b").text().replace('%','');

                funds.push({ fund, price, risk, totalreturn});
            });
            
            res.statusCode = 200;
            res.write(JSON.stringify(funds, null, 4));
        } catch (err) {
            res.statusCode = 400;
            res.write("Unable to process request.");
        }
        res.end();
    })
    .listen(8080);

do note, the URL does not change when different year is selected, only the value for total return is changed请注意,选择不同年份时,URL 不会改变,只有总回报的值会改变

This happens because the page uses javascript to generate the content.发生这种情况是因为该页面使用 javascript 来生成内容。 In this case, you need something like Puppeteer.在这种情况下,您需要像 Puppeteer 这样的东西。 That's what you need:这就是你需要的:

const puppeteer = require("puppeteer");

const availableFunds = "10000";
const years = 2; // 3 for 0.5 years; 2 for 1 year; 1 for 2 years, 0 for 3 years.

async function start() {
  const browser = await puppeteer.launch({
    headless: false,
  });

  const page = await browser.newPage();
  await page.goto("https://www.imoney.my/unit-trust-investments");
  await page.waitForSelector(".product-item");

  await page.focus("#amount");
  for (let i = 0; i < 5; i++) {
    await page.keyboard.press("Backspace");
  }
  await page.type("#amount", availableFunds);

  await page.click("#tenure");
  for (let i = 0; i < years; i++) {
    await page.keyboard.press("ArrowUp");
  }
  await page.keyboard.press("Enter");
  const funds = await page.evaluate(() => {
    const funds = [];
    Array.from(document.querySelectorAll(".product-item")).forEach((el) => {
      const fund = el.querySelector(".title")?.textContent.trim();
      const price = el.querySelector(".investmentReturnValue")?.textContent.trim();
      const risk = el.querySelector(".col-title .info-desc dd")?.textContent.trim();
      const totalreturn = el.querySelector(".col-rate.text-left .info-desc .ir-value")?.textContent.trim();
      if (fund && price && risk && totalreturn) funds.push({ fund, price, risk, totalreturn });
    });
    return funds;
  });

  console.log(funds);

  browser.close();
}

start();

Output:输出:

[
  {
    fund: 'Aberdeen Standard Islamic World Equity Fund - Class A',
    price: 'RM 12,651.20',
    risk: 'Medium\n                                7/10',
    totalreturn: '26.51'
  },
  {
    fund: 'Affin Hwang Select Balanced Fund',
    price: 'RM 10,355.52',
    risk: 'Medium\n                                5/10',
    totalreturn: '3.56'
  },
... and others

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM