[英]How to extract innertext of nested tags with the same class names
我想熟悉 javascript 和 Puppeteer。 因此,請將此視為一個實踐示例。 我設法將 Puppeteer 中的一個腳本放在一起(出於學習目的),該腳本從我的 HTML 代碼塊中給定的所有 4 個 class 名稱中獲取 innerText。對於大多數部分,腳本運行並工作。 class 的名字是:
class="fc-item__kicker"
class="js-headline-text"
link a href
class="fc-item__standfirst"
問題是相同選擇器有多個實例。
這意味着我只能在第一個實例之后提取內部文本,但不能在第二個實例之后提取內部文本。 我怎樣才能做到這一點?
為了訓練自己,我將使用 The Guardian 的首頁,因為它有深刻而復雜的嵌套 html 標簽和類。
這是 HTML 代碼塊的一小部分:
<!DOCTYPE html>
<html>
<body>
<div class="l-side-margins">
<div class="facia-page">
<section id="headlines" class="fc-container fc-container--has-toggle">
<div class="fc-container__inner">
<div class="fc-container--rolled-up-hide fc-container__body" id="container-10f21d96-18f6-426f-821b-19df55dfb831">
<div class="fc-slice-wrapper">
<ul class="u-unstyled l-row l-row--cols-4 fc-slice fc-slice--qqq-q">
<li class="fc-slice__item l-row__item l-row__item--span-3 u-faux-block-link">
<div class="fc-item__container">
<div class="fc-item__content">
<div class="fc-item__header">
<h3 class="fc-item__title"><a class="fc-item__link" href="https://www.example.com"><span class="fc-item__kicker">Monterey Park shooting</span> <span class="u-faux-block-link__cta fc-item__headline"><span class="js-headline-text">Beloved dance hall manager named among victims</span></span></a></h3>
</div>
<div class="fc-item__standfirst-wrapper">
<div class="fc-item__standfirst">
California officials yet to identify eight others who died in Saturday attack, at least 36th mass shooting in US so far this year
</div>
</div>
<div class="fc-item__footer--vertical">
<ul class="fc-sublinks u-unstyled u-faux-block-link__promote">
<li class="fc-sublink fc-sublink--pillar-news fc-sublink--type-article">
<h4 class="fc-sublink__title"><a class="fc-sublink__link" href="https://www.example.com"><span class="fc-sublink__kicker">LA mass shooting</span> Man who disarmed California shooter tells of violent struggle for gun</a></h4>
</li>
</ul>
</div>
</div>
</div>
</li>
</ul>
</div>
<div class="fc-slice-wrapper">
<ul class="u-unstyled l-row l-row--cols-4 fc-slice fc-slice--q-q-ql-ql">
<li class="fc-slice__item l-row__item l-row__item--span-1 u-faux-block-link">
<div class="fc-item fc-item--has-image fc-item--pillar-news fc-item--type-article js-fc-item fc-item--list-media-mobile fc-item--standard-tablet js-snappable">
<div class="fc-item__container">
<div class="fc-item__media-wrapper">
<div class="fc-item__image-container u-responsive-ratio"></div>
</div>
<div class="fc-item__content">
<div class="fc-item__header">
<h3 class="fc-item__title"><a class="fc-item__link" href="https://www.example.com"><span class="fc-item__kicker">Germany</span> <span class="u-faux-block-link__cta fc-item__headline"><span class="js-headline-text">Five charged over second alleged far-right plot against government</span></span></a></h3>
</div>
<div class="fc-item__standfirst-wrapper">
<div class="fc-item__standfirst">
Four men and a woman accused of planning to abduct health minister and overthrow government
</div>
<div class="fc-item__meta js-item__meta"></div>
</div>
</div>
</div>
</div>
</li>
</ul>
</div>
</div>
</div>
</section>
</div>
</div>
</body>
</html>
這是我的劇本
const fs = require('fs');
const puppeteer = require('puppeteer');
async function run() {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://www.theguardian.com/international/');
const headlines = await page.evaluate(() => Array.from(document.querySelectorAll('#headlines'), (e) => ({
kicker: e.querySelector('.fc-item__header .fc-item__kicker').innerText,
headline: e.querySelector('.fc-item__header .js-headline-text').innerText,
link: e.querySelector('.fc-item__header a').href,
standfirst: e.querySelector('.fc-item__standfirst-wrapper .fc-item__standfirst').textContent.replaceAll(" ", " ").trim(),
})));
console.log(headlines);
console.log(headlines);
// Save data to JSON file
fs.writeFile('headlines.json', JSON.stringify(headlines), (err) => {
if (err) throw err;
console.log('File saved');
});
await browser.close();
}
run();
這是期望的結果:
[
{
kicker: 'Monterey Park shooting',
headline: 'Beloved dance hall manager named among victims'',
link: 'https://www.example.com',
standfirst: 'California officials yet to identify eight others who died in Saturday attack, at least 36th mass shooting in US so far this year'
}
{
kicker: 'Germany'
headline: 'Five charged over second alleged far-right plot against government'
link: 'https://www.example.com'
standfirst: 'Four men and a woman accused of planning to abduct health minister and overthrow government'
}
]
當我訪問該站點時,我沒有看到您在提供的標記中顯示的#header
元素,但基本問題是您在所有文章而不是每篇文章上循環一個包裝。 ID 在幾乎所有(有效)網站中都是唯一的,因此通常沒有必要嘗試遍歷幾乎保證最多只有一項的數組。
嘗試將.fc-item__container
添加到您的#headline
選擇器: #headline.fc-item__container
,或只是.fc-item__container
,如下所示。
const fs = require("node:fs/promises");
const puppeteer = require("puppeteer"); // ^19.4.1
const url = "<Your URL>";
let browser;
(async () => {
browser = await puppeteer.launch();
const [page] = await browser.pages();
await page.setJavaScriptEnabled(false);
await page.setRequestInterception(true);
page.on("request", req => {
if (req.url() !== url) {
req.abort();
}
else {
req.continue();
}
});
await page.goto(url, {waitUntil: "domcontentloaded"});
const data = await page.$$eval(".fc-item__container", els =>
els.map(e => {
const text = s => e.querySelector(s)?.textContent.trim();
return {
kicker: text(".fc-item__header .fc-item__kicker"),
headline: text(".fc-item__header .js-headline-text"),
link: e.querySelector(".fc-item__header a").getAttribute("href"),
standfirst: text(".fc-item__standfirst-wrapper .fc-item__standfirst"),
};
})
);
await fs.writeFile("headlines.json", JSON.stringify(data, null, 2));
})()
.catch(err => console.error(err))
.finally(() => browser?.close());
由於數據在 static HTML 中,我們可以阻止所有請求,等待 DOM 內容加載並禁用 JS。
更好的是,我們可以完全跳過 Puppeteer 並使用輕量級 HTML 解析器和 HTTP 請求:
const cheerio = require("cheerio"); // 1.0.0-rc.12
const fs = require("node:fs/promises");
const url = "<Your URL>";
fetch(url) // Node 18 or install node-fetch, or use another library like axios
.then(res => {
if (!res.ok) {
throw Error(res.statusText);
}
return res.text();
})
.then(html => {
const $ = cheerio.load(html);
const data = [...$(".fc-item__container")].map(e => {
const text = s => $(e).find(s).first().text().trim();
return {
kicker: text(".fc-item__header .fc-item__kicker"),
headline: text(".fc-item__header .js-headline-text"),
link: $(e).find(".fc-item__header a")?.attr("href"),
standfirst: text(".fc-item__standfirst-wrapper .fc-item__standfirst"),
};
});
return fs.writeFile("headlines.json", JSON.stringify(data, null, 2));
});
我正在使用 promises fs
API 來避免競爭條件和回調丑陋。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.