繁体   English   中英

使用 Puppeteer 抓取 web 中不同类型数据的相同选择器

[英]same selector for different types data in web scraping using Puppeteer

我是一名新手 web 开发人员,最近开始编码。

我只熟悉HTML/CSS/JS & NODE

我目前正在研究一个页面抓取项目使用 puppeteer

问题- 在这样的代码场景中↓不同类型的数据有相同的选择器
(在这种情况下是 - a[rel="tag"] )。

<span class="clip-link">

  <h4>Stars:</h4>
  <a href="https://www.media.com/ACTORS/darshan-raval/" rel="tag">Darshan Raval</a>,
  <a href="https://www.media.com/ACTORS/priyanka-chopra/" rel="tag">Priyanka Chopra</a>
  <h4>Singers:</h4>
  <a href="https://www.media.com/SINGERS/hardy-sandhu/" rel="tag">Hardy Sandhu</a>,
  <a href="https://www.media.com/SINGERS/amit-trivedi/" rel="tag">Amit Trivedi</a>,
</span>

或者

<span class="clip-link">

  <h4>Stars:</h4>
  <a href="https://www.media.com/ACTORS/darshan-raval/" rel="tag">Darshan Raval</a>,
  <a href="https://www.media.com/ACTORS/priyanka-chopra/" rel="tag">Priyanka Chopra</a>,
  <a href="https://www.media.com/ACTORS/amir-khan/" rel="tag">Amir Khan</a>
  <h4>Singers:</h4>
  <a href="https://www.media.com/SINGERS/hardy-sandhu/" rel="tag">Hardy Sandhu</a>,
  <a href="https://www.media.com/SINGERS/amit-trivedi/" rel="tag">Amit Trivedi</a>,
</span>

我们可以在这些标签中看到的唯一共同区别在于它们的 URL,就在域名之后。


问题-

我如何 select,并根据 URL 差异(“.com/ACTORS/.”或“.com/SINGERS.”)对这些标签进行分类,然后获取元素的 innerText以存储它们。

actors = ["Darshan Raval","Priyanka Chopra"]
singers = ["Hardy Sandhu","Amit Trivedi"]

或者

actors = ["Darshan Raval","Priyanka Chopra","Amir Khan"]
singers = ["Hardy Sandhu","Amit Trivedi"]

“明星”和“歌手”的数量一直不同,所以我无法定义我固定的数组计数方法。

你可以尝试这样的事情:

import puppeteer from 'puppeteer';

const browser = await puppeteer.launch();

const html = `
  <!doctype html>
  <html>
    <head><meta charset='UTF-8'><title>Test</title></head>
    <body>
      <span class="clip-link">
        <h4>Stars:</h4>
        <a href="https://www.media.com/ACTORS/darshan-raval/" rel="tag">Darshan Raval</a>,
        <a href="https://www.media.com/ACTORS/priyanka-chopra/" rel="tag">Priyanka Chopra</a>,
        <a href="https://www.media.com/ACTORS/amir-khan/" rel="tag">Amir Khan</a>
        <h4>Singers:</h4>
        <a href="https://www.media.com/SINGERS/hardy-sandhu/" rel="tag">Hardy Sandhu</a>,
        <a href="https://www.media.com/SINGERS/amit-trivedi/" rel="tag">Amit Trivedi</a>,
      </span>
    </body>
  </html>`;

try {
  const [page] = await browser.pages();

  await page.goto(`data:text/html,${html}`);

  const data = await page.evaluate(() => {
    const tags = [...document.querySelectorAll('a[rel="tag"]')];
    return tags.reduce((persons, tag) => {
      const type = tag.pathname.split('/')[1];
      persons[type] ??= [];
      persons[type].push(tag.innerText);
      return persons;
    }, {});
  });
  console.log(data);
} catch (err) { console.error(err); } finally { await browser.close(); }

Output:

{
  ACTORS: [ 'Darshan Raval', 'Priyanka Chopra', 'Amir Khan' ],
  SINGERS: [ 'Hardy Sandhu', 'Amit Trivedi' ]
}

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM