[英]same selector for different types data in web scraping using Puppeteer
我是一名新手 web 开发人员,最近开始编码。
我只熟悉HTML/CSS/JS & NODE 。
我目前正在研究一个页面抓取项目并使用 puppeteer
问题- 在这样的代码场景中↓不同类型的数据有相同的选择器。
(在这种情况下是 - a[rel="tag"] )。
<span class="clip-link">
<h4>Stars:</h4>
<a href="https://www.media.com/ACTORS/darshan-raval/" rel="tag">Darshan Raval</a>,
<a href="https://www.media.com/ACTORS/priyanka-chopra/" rel="tag">Priyanka Chopra</a>
<h4>Singers:</h4>
<a href="https://www.media.com/SINGERS/hardy-sandhu/" rel="tag">Hardy Sandhu</a>,
<a href="https://www.media.com/SINGERS/amit-trivedi/" rel="tag">Amit Trivedi</a>,
</span>
或者
<span class="clip-link">
<h4>Stars:</h4>
<a href="https://www.media.com/ACTORS/darshan-raval/" rel="tag">Darshan Raval</a>,
<a href="https://www.media.com/ACTORS/priyanka-chopra/" rel="tag">Priyanka Chopra</a>,
<a href="https://www.media.com/ACTORS/amir-khan/" rel="tag">Amir Khan</a>
<h4>Singers:</h4>
<a href="https://www.media.com/SINGERS/hardy-sandhu/" rel="tag">Hardy Sandhu</a>,
<a href="https://www.media.com/SINGERS/amit-trivedi/" rel="tag">Amit Trivedi</a>,
</span>
我们可以在这些标签中看到的唯一共同区别在于它们的 URL,就在域名之后。
问题-
我如何 select,并根据 URL 差异(“.com/ACTORS/.”或“.com/SINGERS.”)对这些标签进行分类,然后获取元素的 innerText以存储它们。
actors = ["Darshan Raval","Priyanka Chopra"]
singers = ["Hardy Sandhu","Amit Trivedi"]
或者
actors = ["Darshan Raval","Priyanka Chopra","Amir Khan"]
singers = ["Hardy Sandhu","Amit Trivedi"]
“明星”和“歌手”的数量一直不同,所以我无法定义我固定的数组计数方法。
你可以尝试这样的事情:
import puppeteer from 'puppeteer';
const browser = await puppeteer.launch();
const html = `
<!doctype html>
<html>
<head><meta charset='UTF-8'><title>Test</title></head>
<body>
<span class="clip-link">
<h4>Stars:</h4>
<a href="https://www.media.com/ACTORS/darshan-raval/" rel="tag">Darshan Raval</a>,
<a href="https://www.media.com/ACTORS/priyanka-chopra/" rel="tag">Priyanka Chopra</a>,
<a href="https://www.media.com/ACTORS/amir-khan/" rel="tag">Amir Khan</a>
<h4>Singers:</h4>
<a href="https://www.media.com/SINGERS/hardy-sandhu/" rel="tag">Hardy Sandhu</a>,
<a href="https://www.media.com/SINGERS/amit-trivedi/" rel="tag">Amit Trivedi</a>,
</span>
</body>
</html>`;
try {
const [page] = await browser.pages();
await page.goto(`data:text/html,${html}`);
const data = await page.evaluate(() => {
const tags = [...document.querySelectorAll('a[rel="tag"]')];
return tags.reduce((persons, tag) => {
const type = tag.pathname.split('/')[1];
persons[type] ??= [];
persons[type].push(tag.innerText);
return persons;
}, {});
});
console.log(data);
} catch (err) { console.error(err); } finally { await browser.close(); }
Output:
{
ACTORS: [ 'Darshan Raval', 'Priyanka Chopra', 'Amir Khan' ],
SINGERS: [ 'Hardy Sandhu', 'Amit Trivedi' ]
}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.