簡體   English   中英

如何提取具有相同 class 名稱的嵌套標簽的內部文本

[英]How to extract innertext of nested tags with the same class names

我想熟悉 javascript 和 Puppeteer。 因此,請將此視為一個實踐示例。 我設法將 Puppeteer 中的一個腳本放在一起(出於學習目的),該腳本從我的 HTML 代碼塊中給定的所有 4 個 class 名稱中獲取 innerText。對於大多數部分,腳本運行並工作。 class 的名字是:

class="fc-item__kicker"
class="js-headline-text"
link a href
class="fc-item__standfirst"

問題是相同選擇器有多個實例。

這意味着我只能在第一個實例之后提取內部文本,但不能在第二個實例之后提取內部文本。 我怎樣才能做到這一點?

為了訓練自己,我將使用 The Guardian 的首頁,因為它有深刻而復雜的嵌套 html 標簽和類。

這是 HTML 代碼塊的一小部分:

<!DOCTYPE html>
<html>
<body>
    <div class="l-side-margins">
        <div class="facia-page">
            <section id="headlines" class="fc-container fc-container--has-toggle">
                <div class="fc-container__inner">
                    <div class="fc-container--rolled-up-hide fc-container__body" id="container-10f21d96-18f6-426f-821b-19df55dfb831">
                        <div class="fc-slice-wrapper">
                            <ul class="u-unstyled l-row l-row--cols-4 fc-slice fc-slice--qqq-q">
                                <li class="fc-slice__item l-row__item l-row__item--span-3 u-faux-block-link">
                                    <div class="fc-item__container">
                                        <div class="fc-item__content">
                                            <div class="fc-item__header">
                                                <h3 class="fc-item__title"><a class="fc-item__link" href="https://www.example.com"><span class="fc-item__kicker">Monterey Park shooting</span> <span class="u-faux-block-link__cta fc-item__headline"><span class="js-headline-text">Beloved dance hall manager named among victims</span></span></a></h3>
                                            </div>
                                            <div class="fc-item__standfirst-wrapper">
                                                <div class="fc-item__standfirst">
                                                    California officials yet to identify eight others who died in Saturday attack, at least 36th mass shooting in US so far this year
                                                </div>
                                            </div>
                                            <div class="fc-item__footer--vertical">
                                                <ul class="fc-sublinks u-unstyled u-faux-block-link__promote">
                                                    <li class="fc-sublink fc-sublink--pillar-news fc-sublink--type-article">
                                                        <h4 class="fc-sublink__title"><a class="fc-sublink__link" href="https://www.example.com"><span class="fc-sublink__kicker">LA mass shooting</span> Man who disarmed California shooter tells of violent struggle for gun</a></h4>
                                                    </li>
                                                </ul>
                                            </div>
                                        </div>
                                    </div>
                                </li>
                            </ul>
                        </div>
                        <div class="fc-slice-wrapper">
                            <ul class="u-unstyled l-row l-row--cols-4 fc-slice fc-slice--q-q-ql-ql">
                                <li class="fc-slice__item l-row__item l-row__item--span-1 u-faux-block-link">
                                    <div class="fc-item fc-item--has-image fc-item--pillar-news fc-item--type-article js-fc-item fc-item--list-media-mobile fc-item--standard-tablet js-snappable">
                                        <div class="fc-item__container">
                                            <div class="fc-item__media-wrapper">
                                                <div class="fc-item__image-container u-responsive-ratio"></div>
                                            </div>
                                            <div class="fc-item__content">
                                                <div class="fc-item__header">
                                                    <h3 class="fc-item__title"><a class="fc-item__link" href="https://www.example.com"><span class="fc-item__kicker">Germany</span> <span class="u-faux-block-link__cta fc-item__headline"><span class="js-headline-text">Five charged over second alleged far-right plot against government</span></span></a></h3>
                                                </div>
                                                <div class="fc-item__standfirst-wrapper">
                                                    <div class="fc-item__standfirst">
                                                        Four men and a woman accused of planning to abduct health minister and overthrow government
                                                    </div>
                                                    <div class="fc-item__meta js-item__meta"></div>
                                                </div>
                                            </div>
                                        </div>
                                    </div>
                                </li>
                            </ul>
                        </div>
                    </div>
                </div>
            </section>
        </div>
    </div>
</body>
</html>

這是我的劇本

const fs = require('fs');
const puppeteer = require('puppeteer');

async function run() {
    const browser = await puppeteer.launch();
    const page = await browser.newPage();
    await page.goto('https://www.theguardian.com/international/');

    const headlines = await page.evaluate(() => Array.from(document.querySelectorAll('#headlines'), (e) => ({
        kicker: e.querySelector('.fc-item__header .fc-item__kicker').innerText,
        headline: e.querySelector('.fc-item__header .js-headline-text').innerText,
        link: e.querySelector('.fc-item__header  a').href,
        standfirst: e.querySelector('.fc-item__standfirst-wrapper .fc-item__standfirst').textContent.replaceAll("  ", " ").trim(),
    })));

    console.log(headlines);
    
    console.log(headlines);
    // Save data to JSON file
    fs.writeFile('headlines.json', JSON.stringify(headlines), (err) => {
        if (err) throw err;
        console.log('File saved');
    });

    await browser.close();
}
run();

這是期望的結果:

[
  {
    kicker: 'Monterey Park shooting',
    headline: 'Beloved dance hall manager named among victims'',
    link: 'https://www.example.com',
    standfirst: 'California officials yet to identify eight others who died in Saturday attack, at least 36th mass shooting in US so far this year'
  }
  {
    kicker: 'Germany'
    headline: 'Five charged over second alleged far-right plot against government'
    link: 'https://www.example.com'
    standfirst: 'Four men and a woman accused of planning to abduct health minister and overthrow government'
  }
]

當我訪問該站點時,我沒有看到您在提供的標記中顯示的#header元素,但基本問題是您在所有文章而不是每篇文章上循環一個包裝。 ID 在幾乎所有(有效)網站中都是唯一的,因此通常沒有必要嘗試遍歷幾乎保證最多只有一項的數組。

嘗試將.fc-item__container添加到您的#headline選擇器: #headline.fc-item__container ,或只是.fc-item__container ,如下所示。

const fs = require("node:fs/promises");
const puppeteer = require("puppeteer"); // ^19.4.1

const url = "<Your URL>";

let browser;
(async () => {
  browser = await puppeteer.launch();
  const [page] = await browser.pages();
  await page.setJavaScriptEnabled(false);
  await page.setRequestInterception(true);
  page.on("request", req => {
    if (req.url() !== url) {
      req.abort();
    }
    else {
      req.continue();
    }
  });
  await page.goto(url, {waitUntil: "domcontentloaded"});
  const data = await page.$$eval(".fc-item__container", els =>
    els.map(e => {
      const text = s => e.querySelector(s)?.textContent.trim();
      return {
        kicker: text(".fc-item__header .fc-item__kicker"),
        headline: text(".fc-item__header .js-headline-text"),
        link: e.querySelector(".fc-item__header a").getAttribute("href"),
        standfirst: text(".fc-item__standfirst-wrapper .fc-item__standfirst"),
      };
    })
  );
  await fs.writeFile("headlines.json", JSON.stringify(data, null, 2));
})()
  .catch(err => console.error(err))
  .finally(() => browser?.close());

由於數據在 static HTML 中,我們可以阻止所有請求,等待 DOM 內容加載並禁用 JS。

更好的是,我們可以完全跳過 Puppeteer 並使用輕量級 HTML 解析器和 HTTP 請求:

const cheerio = require("cheerio"); // 1.0.0-rc.12
const fs = require("node:fs/promises");

const url = "<Your URL>";

fetch(url) // Node 18 or install node-fetch, or use another library like axios
  .then(res => {
    if (!res.ok) {
      throw Error(res.statusText);
    }

    return res.text();
  })
  .then(html => {
    const $ = cheerio.load(html);
    const data = [...$(".fc-item__container")].map(e => {
      const text = s => $(e).find(s).first().text().trim();
      return {
        kicker: text(".fc-item__header .fc-item__kicker"),
        headline: text(".fc-item__header .js-headline-text"),
        link: $(e).find(".fc-item__header a")?.attr("href"),
        standfirst: text(".fc-item__standfirst-wrapper .fc-item__standfirst"),
      };
    });
    return fs.writeFile("headlines.json", JSON.stringify(data, null, 2));
  });

我正在使用 promises fs API 來避免競爭條件和回調丑陋。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM