簡體   English   中英

如何從包含“p”標簽和內部文本混合的 HTML 元素中提取文本?

[英]How can I extract text from an HTML element containing a mix of `p` tags and inner text?

我正在使用名為Reaver的 jsoup 周圍的 Clojure 包裝器來抓取一個結構不佳的 HTML 網站。 下面是一些 HTML 結構的示例:

<div id="article">
  <aside>unwanted text</aside>
  <p>Some text</p>
  <nav><ol><li><h2>unwanted text</h2></li></ol></nav>
  <p>More text</p>
  <h2>A headline</h2>
  <figure><figcaption>unwanted text</figcaption></figure>
  <p>More text</p>
  Here is a paragraph made of some raw text directly in the div
  <p>Another paragraph of text</p>
  More raw text and this one has an <a>anchor tag</a> inside
  <dl>
    <dd>unwanted text</dd>
  </dl>
  <p>Etc etc</p>
</div>

這個div代表 wiki 上的一篇文章。 我想從中提取文本,但如您所見,有些段落在p標簽中,有些則直接包含在 div 中。 我還需要標題和錨標記文本。

我知道如何從所有pah標簽中解析和提取文本,我可以為div使用 select 並從中提取內部文本,但問題是我最終選擇了兩個文本我需要以某種方式合並。

如何從此 div 中提取文本,以便按順序提取pah標簽中的所有文本以及div上的內部文本? 結果應該是文本段落,其順序與 HTML 中的順序相同。

這是我目前用來提取的內容,但結果中缺少內部div文本:

(defn get-texts [url]
  (:paragraphs (extract (parse (slurp url))
                        [:paragraphs]
                        "#article > *:not(aside, nav, table, figure, dl)" text)))

另請注意,此div中還會出現其他不需要的元素,例如, asidefigure等。這些元素包含文本,以及帶有文本的嵌套元素,不應包含在結果中。

您可以將整篇文章提取為 JSoup object(可能是Element ),然后使用reaver/to-edn將其轉換為 EDN 表示。 然后你 go 通過:content處理字符串( TextNode的結果)和具有你感興趣的:tag的元素。

(由 vaer-k 編寫)

(defn get-article [url]
  (:article (extract (parse (slurp url))
                     [:article]
                     "#article"
                     edn)))

(defn text-elem?
  [element]
  (or (string? element)
      (contains? #{:p :a :b :i} (:tag element))))

(defn extract-text
  [{content :content}]
  (let [text-children (filter text-elem? content)]
    (reduce #(if (string? %2)
               (str %1 %2)
               (str %1 (extract-text %2)))
            ""
            text-children)))

(defn extract-article [url]
  (-> url
      get-article
      extract-text))

您可以使用tupelo.forest庫來解決這個問題,該庫於上周在 Clojure/Conj 2019 的“Unsession”中進行了介紹。

以下是作為單元測試編寫的解決方案。 首先是一些聲明和示例數據:

(ns tst.demo.core
  (:use tupelo.forest tupelo.core tupelo.test)
  (:require
    [clojure.string :as str]
    [schema.core :as s]
    [tupelo.string :as ts]))

(def html-src
  "<div id=\"article\">
    <aside>unwanted text</aside>
    <p>Some text</p>
    <nav><ol><li><h2>unwanted text</h2></li></ol></nav>
    <p>More text</p>
    <h2>A headline</h2>
    <figure><figcaption>unwanted text</figcaption></figure>
    <p>More text</p>
    Here is a paragraph made of some raw text directly in the div
    <p>Another paragraph of text</p>
    More raw text and this one has an <a>anchor tag</a> inside
    <dl>
    <dd>unwanted text</dd>
    </dl>
    <p>Etc etc</p>
  </div> ")

首先,我們在刪除所有換行符等后將 html 數據(一棵樹)添加到森林中。這在內部使用Java “TagSoup”解析器

(dotest
  (hid-count-reset)
  (with-forest (new-forest)
    (let [root-hid            (add-tree-html
                                (ts/collapse-whitespace html-src))
          unwanted-node-paths (find-paths-with root-hid [:** :*]
                                (s/fn [path :- [HID]]
                                  (let [hid  (last path)
                                        node (hid->node hid)
                                        tag  (grab :tag node)]
                                    (or
                                      (= tag :aside)
                                      (= tag :nav)
                                      (= tag :figure)
                                      (= tag :dl)))))]
      (newline) (spyx-pretty :html-orig (hid->bush root-hid))

spyx-pretty顯示數據的“灌木”格式(類似於打嗝格式):

:html-orig (hid->bush root-hid) =>
[{:tag :html}
 [{:tag :body}
  [{:id "article", :tag :div}
   [{:tag :aside, :value "unwanted text"}]
   [{:tag :p, :value "Some text"}]
   [{:tag :nav}
    [{:tag :ol} [{:tag :li} [{:tag :h2, :value "unwanted text"}]]]]
   [{:tag :p, :value "More text"}]
   [{:tag :h2, :value "A headline"}]
   [{:tag :figure} [{:tag :figcaption, :value "unwanted text"}]]
   [{:tag :p, :value "More text"}]
   [{:tag :tupelo.forest/raw,
     :value
     " Here is a paragraph made of some raw text directly in the div "}]
   [{:tag :p, :value "Another paragraph of text"}]
   [{:tag :tupelo.forest/raw,
     :value " More raw text and this one has an "}]
   [{:tag :a, :value "anchor tag"}]
   [{:tag :tupelo.forest/raw, :value " inside "}]
   [{:tag :dl} [{:tag :dd, :value "unwanted text"}]]
   [{:tag :p, :value "Etc etc"}]]]]

所以我們可以看到數據已經正確加載了。 現在,我們要刪除find-paths-with標識的所有不需要的節點。 然后,打印修改后的樹:

      (doseq [path unwanted-node-paths]
        (remove-path-subtree path))
      (newline) (spyx-pretty :html-cleaned (hid->bush root-hid))

:html-cleaned (hid->bush root-hid) =>
[{:tag :html}
 [{:tag :body}
  [{:id "article", :tag :div}
   [{:tag :p, :value "Some text"}]
   [{:tag :p, :value "More text"}]
   [{:tag :h2, :value "A headline"}]
   [{:tag :p, :value "More text"}]
   [{:tag :tupelo.forest/raw,
     :value
     " Here is a paragraph made of some raw text directly in the div "}]
   [{:tag :p, :value "Another paragraph of text"}]
   [{:tag :tupelo.forest/raw,
     :value " More raw text and this one has an "}]
   [{:tag :a, :value "anchor tag"}]
   [{:tag :tupelo.forest/raw, :value " inside "}]
   [{:tag :p, :value "Etc etc"}]]]]

此時,我們只需遍歷樹並將所有幸存的文本節點累積到一個向量中:

      (let [txt-accum (atom [])]
        (walk-tree root-hid
          {:enter (fn [path]
                    (let [hid   (last path)
                          node  (hid->node hid)
                          value (:value node)] ; may not be present
                      (when (string? value)
                        (swap! txt-accum append value))))})

為了驗證,我們將找到的文本節點(忽略空格)與期望的結果進行比較:

        (is-nonblank=  (str/join \space @txt-accum)
          "Some text
           More text
           A headline
           More text
           Here is a paragraph made of some raw text directly in the div
           Another paragraph of text
           More raw text and this one has an
           anchor tag
            inside
           Etc etc")))))

有關更多詳細信息,請參閱自述文件API 文檔 請務必查看Lightning Talk以了解概覽。

試試這個,如果你想用 javascript 的方式來做:):

 var element = document.getElementById('article'); var clnElem = element.cloneNode(true); Array.prototype.forEach.call(clnElem.children, function (elem) { if(elem.tagName === 'ASIDE' || elem.tagName === 'NAV' || elem.tagName === 'FIGURE') { elem.innerText = ''; } }); console.log(clnElem.textContent || clnElem.innerText);
 <div id="article"> <aside>...</aside> <p>Some text</p> <nav>...</nav> <p>More text</p> <h2>A headline</h2> <p>More text</p> Here is a paragraph made of some raw text directly in the div <p>Another paragraph of text</p> More raw text and this one has an <a>anchor tag</a> inside <p>Etc etc</p> </div>

您可以使用 javascript 的 textContent 或 innerText 屬性來實現這一點,請找到以下代碼片段:

 var element = document.getElementById('article'); console.log(element.textContent || element.innerText);
 <div id="article"> <aside>...</aside> <p>Some text</p> <nav>...</nav> <p>More text</p> <h2>A headline</h2> <p>More text</p> Here is a paragraph made of some raw text directly in the div <p>Another paragraph of text</p> More raw text and this one has an <a>anchor tag</a> inside <p>Etc etc</p> </div>

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM