[英]How can I extract text from an HTML element containing a mix of `p` tags and inner text?
我正在使用名為Reaver的 jsoup 周圍的 Clojure 包裝器來抓取一個結構不佳的 HTML 網站。 下面是一些 HTML 結構的示例:
<div id="article">
<aside>unwanted text</aside>
<p>Some text</p>
<nav><ol><li><h2>unwanted text</h2></li></ol></nav>
<p>More text</p>
<h2>A headline</h2>
<figure><figcaption>unwanted text</figcaption></figure>
<p>More text</p>
Here is a paragraph made of some raw text directly in the div
<p>Another paragraph of text</p>
More raw text and this one has an <a>anchor tag</a> inside
<dl>
<dd>unwanted text</dd>
</dl>
<p>Etc etc</p>
</div>
這個div
代表 wiki 上的一篇文章。 我想從中提取文本,但如您所見,有些段落在p
標簽中,有些則直接包含在 div 中。 我還需要標題和錨標記文本。
我知道如何從所有p
、 a
和h
標簽中解析和提取文本,我可以為div
使用 select 並從中提取內部文本,但問題是我最終選擇了兩個文本我需要以某種方式合並。
如何從此 div 中提取文本,以便按順序提取p
、 a
、 h
標簽中的所有文本以及div
上的內部文本? 結果應該是文本段落,其順序與 HTML 中的順序相同。
這是我目前用來提取的內容,但結果中缺少內部div
文本:
(defn get-texts [url]
(:paragraphs (extract (parse (slurp url))
[:paragraphs]
"#article > *:not(aside, nav, table, figure, dl)" text)))
另請注意,此div
中還會出現其他不需要的元素,例如, aside
、 figure
等。這些元素包含文本,以及帶有文本的嵌套元素,不應包含在結果中。
您可以將整篇文章提取為 JSoup object(可能是Element
),然后使用reaver/to-edn
將其轉換為 EDN 表示。 然后你 go 通過:content
處理字符串( TextNode
的結果)和具有你感興趣的:tag
的元素。
(由 vaer-k 編寫)
(defn get-article [url]
(:article (extract (parse (slurp url))
[:article]
"#article"
edn)))
(defn text-elem?
[element]
(or (string? element)
(contains? #{:p :a :b :i} (:tag element))))
(defn extract-text
[{content :content}]
(let [text-children (filter text-elem? content)]
(reduce #(if (string? %2)
(str %1 %2)
(str %1 (extract-text %2)))
""
text-children)))
(defn extract-article [url]
(-> url
get-article
extract-text))
您可以使用tupelo.forest
庫來解決這個問題,該庫於上周在 Clojure/Conj 2019 的“Unsession”中進行了介紹。
以下是作為單元測試編寫的解決方案。 首先是一些聲明和示例數據:
(ns tst.demo.core
(:use tupelo.forest tupelo.core tupelo.test)
(:require
[clojure.string :as str]
[schema.core :as s]
[tupelo.string :as ts]))
(def html-src
"<div id=\"article\">
<aside>unwanted text</aside>
<p>Some text</p>
<nav><ol><li><h2>unwanted text</h2></li></ol></nav>
<p>More text</p>
<h2>A headline</h2>
<figure><figcaption>unwanted text</figcaption></figure>
<p>More text</p>
Here is a paragraph made of some raw text directly in the div
<p>Another paragraph of text</p>
More raw text and this one has an <a>anchor tag</a> inside
<dl>
<dd>unwanted text</dd>
</dl>
<p>Etc etc</p>
</div> ")
首先,我們在刪除所有換行符等后將 html 數據(一棵樹)添加到森林中。這在內部使用Java “TagSoup”解析器:
(dotest
(hid-count-reset)
(with-forest (new-forest)
(let [root-hid (add-tree-html
(ts/collapse-whitespace html-src))
unwanted-node-paths (find-paths-with root-hid [:** :*]
(s/fn [path :- [HID]]
(let [hid (last path)
node (hid->node hid)
tag (grab :tag node)]
(or
(= tag :aside)
(= tag :nav)
(= tag :figure)
(= tag :dl)))))]
(newline) (spyx-pretty :html-orig (hid->bush root-hid))
spyx-pretty
顯示數據的“灌木”格式(類似於打嗝格式):
:html-orig (hid->bush root-hid) =>
[{:tag :html}
[{:tag :body}
[{:id "article", :tag :div}
[{:tag :aside, :value "unwanted text"}]
[{:tag :p, :value "Some text"}]
[{:tag :nav}
[{:tag :ol} [{:tag :li} [{:tag :h2, :value "unwanted text"}]]]]
[{:tag :p, :value "More text"}]
[{:tag :h2, :value "A headline"}]
[{:tag :figure} [{:tag :figcaption, :value "unwanted text"}]]
[{:tag :p, :value "More text"}]
[{:tag :tupelo.forest/raw,
:value
" Here is a paragraph made of some raw text directly in the div "}]
[{:tag :p, :value "Another paragraph of text"}]
[{:tag :tupelo.forest/raw,
:value " More raw text and this one has an "}]
[{:tag :a, :value "anchor tag"}]
[{:tag :tupelo.forest/raw, :value " inside "}]
[{:tag :dl} [{:tag :dd, :value "unwanted text"}]]
[{:tag :p, :value "Etc etc"}]]]]
所以我們可以看到數據已經正確加載了。 現在,我們要刪除find-paths-with
標識的所有不需要的節點。 然后,打印修改后的樹:
(doseq [path unwanted-node-paths]
(remove-path-subtree path))
(newline) (spyx-pretty :html-cleaned (hid->bush root-hid))
:html-cleaned (hid->bush root-hid) =>
[{:tag :html}
[{:tag :body}
[{:id "article", :tag :div}
[{:tag :p, :value "Some text"}]
[{:tag :p, :value "More text"}]
[{:tag :h2, :value "A headline"}]
[{:tag :p, :value "More text"}]
[{:tag :tupelo.forest/raw,
:value
" Here is a paragraph made of some raw text directly in the div "}]
[{:tag :p, :value "Another paragraph of text"}]
[{:tag :tupelo.forest/raw,
:value " More raw text and this one has an "}]
[{:tag :a, :value "anchor tag"}]
[{:tag :tupelo.forest/raw, :value " inside "}]
[{:tag :p, :value "Etc etc"}]]]]
此時,我們只需遍歷樹並將所有幸存的文本節點累積到一個向量中:
(let [txt-accum (atom [])]
(walk-tree root-hid
{:enter (fn [path]
(let [hid (last path)
node (hid->node hid)
value (:value node)] ; may not be present
(when (string? value)
(swap! txt-accum append value))))})
為了驗證,我們將找到的文本節點(忽略空格)與期望的結果進行比較:
(is-nonblank= (str/join \space @txt-accum)
"Some text
More text
A headline
More text
Here is a paragraph made of some raw text directly in the div
Another paragraph of text
More raw text and this one has an
anchor tag
inside
Etc etc")))))
有關更多詳細信息,請參閱自述文件和API 文檔。 請務必查看Lightning Talk以了解概覽。
試試這個,如果你想用 javascript 的方式來做:):
var element = document.getElementById('article'); var clnElem = element.cloneNode(true); Array.prototype.forEach.call(clnElem.children, function (elem) { if(elem.tagName === 'ASIDE' || elem.tagName === 'NAV' || elem.tagName === 'FIGURE') { elem.innerText = ''; } }); console.log(clnElem.textContent || clnElem.innerText);
<div id="article"> <aside>...</aside> <p>Some text</p> <nav>...</nav> <p>More text</p> <h2>A headline</h2> <p>More text</p> Here is a paragraph made of some raw text directly in the div <p>Another paragraph of text</p> More raw text and this one has an <a>anchor tag</a> inside <p>Etc etc</p> </div>
您可以使用 javascript 的 textContent 或 innerText 屬性來實現這一點,請找到以下代碼片段:
var element = document.getElementById('article'); console.log(element.textContent || element.innerText);
<div id="article"> <aside>...</aside> <p>Some text</p> <nav>...</nav> <p>More text</p> <h2>A headline</h2> <p>More text</p> Here is a paragraph made of some raw text directly in the div <p>Another paragraph of text</p> More raw text and this one has an <a>anchor tag</a> inside <p>Etc etc</p> </div>
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.