如何根據 then 元素的文本從 HTML 數據中的鏈接中提取 href 值？

Question

我的任務是對 web 爬蟲進行編碼，該爬蟲遍歷多個 URL（大約 400 個，但列表可能會增長），每個 URL 具有完全不同的 html 結構並提取包含某些信息的鏈接。 程序事先知道的唯一事情是它應該搜索的關鍵字是什么，但是 html 結構和任何關於在哪里尋找這些關鍵字的語義線索都是未知的。

到目前為止，我已經使用了 Node.js 的 request-promise 模塊向 URL 發送請求，搜索關鍵字的位置：

const htmlResult = await request.get(url);

htmlResult將響應存儲為字符串，如果需要，我可以將其保存為 an.txt 或 .html。

我遇到的問題是我不知道如何指示程序如何根據 url 字符串中不一定存在的單詞來提取 URL。 一個例子可能有助於澄清：

<a href="site/with/no/keywords-just-a-random-string" title="Keywords might be here, but title attribute might be absent"><span class="img"><img data-cfsrc="/thumbpdf/618a8nb4.jpg" alt="" style="display:none;visibility:hidden;"><noscript><img src="/thumbpdf/8bfa84.jpg" alt=""></noscript></span>
<h2>KEYWORDS ARE IN THIS TAG, WHICH IN TURN IS INSIDE THE <a> TAG</h2>
<span class="date--type">2 Nov 2021 </span>
<span class="tag">
oher stuff with no keywords in it</span>
</a>

如您所見，此標簽具有復雜的結構。 我需要解析的關鍵字位於 h2 標簽內，而 h2 標簽又位於 a 標簽內。 但他的標簽也可能是這樣的：

 <a href="string/with/no-keywords-to-parse">KEYWORDS TO PARSE</a>

這里的關鍵字只是在 a 標簽內。

因此，我的問題是如何解析 htmlResult （作為字符串或保存為 a.txt/.html 文件），並且，一旦我得到匹配，指示程序提取 url 的邊界一個標簽，其中我 go 關鍵字的匹配？

當我使用 Node.js 時，我願意使用任何可用的工具。

有人可以就如何應對這一挑戰提供一些建議嗎？

提前非常感謝。

Answer 1

這是非常快速和骯臟的，我相信它可以進一步簡化，但它至少應該讓你更接近你需要的地方。

這假設有一堆<div>元素，每個元素都包含您的一個<a>元素，都在一個文檔中（參見下面的鏈接）。 它使用 xpath 來定位數據：

function xpathEval(xpath, context) {
  return document.evaluate(xpath, context, null, XPathResult.ORDERED_NODE_SNAPSHOT_TYPE, null);
}


desiredHrefs = []

let targets = xpathEval("//div[@class='container']", document);

for (let i = 0; i < targets.snapshotLength; i++) {

  let attribs = xpathEval('.//*/@*', targets.snapshotItem(i)),
    texts = xpathEval('.//*/text()', targets.snapshotItem(i));
    
  for (let k = 0; k < attribs.snapshotLength; k++) {
    attribData = attribs.snapshotItem(k).textContent
    if (attribData.includes("trainer") & attribData.includes("dog")) {
      //either
      //console.log(targets.snapshotItem(i).querySelector('a').getAttribute('href'))
      //ot
      let href = xpathEval('.//a/@href', targets.snapshotItem(i));
      desiredHrefs.push(href.snapshotItem(0).textContent)
    }
  }


  for (let j = 0; j < texts.snapshotLength; j++) {

    data = texts.snapshotItem(j).nodeValue.trim().toLowerCase()

    if (data.includes("trainer") & data.includes("dog")) {
      //either
      //console.log(targets.snapshotItem(i).querySelector('a').getAttribute('href'))
      //or
      let href = xpathEval('.//a/@href', targets.snapshotItem(i));
      desiredHrefs.push(href.snapshotItem(0).textContent)
    }
  }
}
for (let href of [...new Set(desiredHrefs)])
  console.log(href)

您可以在這里看到它的實際效果。

如何根據 then 元素的文本從 HTML 數據中的鏈接中提取 href 值？

問題描述

1 個解決方案

解決方案1
0 2021-11-19 17:34:03

如何根據 then 元素的文本從 HTML 數據中的鏈接中提取 href 值？

問題描述

1 個解決方案

解決方案1 0 2021-11-19 17:34:03

解決方案1
0 2021-11-19 17:34:03