繁体   English   中英

如何从包含“p”标签和内部文本混合的 HTML 元素中提取文本?

[英]How can I extract text from an HTML element containing a mix of `p` tags and inner text?

我正在使用名为Reaver的 jsoup 周围的 Clojure 包装器来抓取一个结构不佳的 HTML 网站。 下面是一些 HTML 结构的示例:

<div id="article">
  <aside>unwanted text</aside>
  <p>Some text</p>
  <nav><ol><li><h2>unwanted text</h2></li></ol></nav>
  <p>More text</p>
  <h2>A headline</h2>
  <figure><figcaption>unwanted text</figcaption></figure>
  <p>More text</p>
  Here is a paragraph made of some raw text directly in the div
  <p>Another paragraph of text</p>
  More raw text and this one has an <a>anchor tag</a> inside
  <dl>
    <dd>unwanted text</dd>
  </dl>
  <p>Etc etc</p>
</div>

这个div代表 wiki 上的一篇文章。 我想从中提取文本,但如您所见,有些段落在p标签中,有些则直接包含在 div 中。 我还需要标题和锚标记文本。

我知道如何从所有pah标签中解析和提取文本,我可以为div使用 select 并从中提取内部文本,但问题是我最终选择了两个文本我需要以某种方式合并。

如何从此 div 中提取文本,以便按顺序提取pah标签中的所有文本以及div上的内部文本? 结果应该是文本段落,其顺序与 HTML 中的顺序相同。

这是我目前用来提取的内容,但结果中缺少内部div文本:

(defn get-texts [url]
  (:paragraphs (extract (parse (slurp url))
                        [:paragraphs]
                        "#article > *:not(aside, nav, table, figure, dl)" text)))

另请注意,此div中还会出现其他不需要的元素,例如, asidefigure等。这些元素包含文本,以及带有文本的嵌套元素,不应包含在结果中。

您可以将整篇文章提取为 JSoup object(可能是Element ),然后使用reaver/to-edn将其转换为 EDN 表示。 然后你 go 通过:content处理字符串( TextNode的结果)和具有你感兴趣的:tag的元素。

(由 vaer-k 编写)

(defn get-article [url]
  (:article (extract (parse (slurp url))
                     [:article]
                     "#article"
                     edn)))

(defn text-elem?
  [element]
  (or (string? element)
      (contains? #{:p :a :b :i} (:tag element))))

(defn extract-text
  [{content :content}]
  (let [text-children (filter text-elem? content)]
    (reduce #(if (string? %2)
               (str %1 %2)
               (str %1 (extract-text %2)))
            ""
            text-children)))

(defn extract-article [url]
  (-> url
      get-article
      extract-text))

您可以使用tupelo.forest库来解决这个问题,该库于上周在 Clojure/Conj 2019 的“Unsession”中进行了介绍。

以下是作为单元测试编写的解决方案。 首先是一些声明和示例数据:

(ns tst.demo.core
  (:use tupelo.forest tupelo.core tupelo.test)
  (:require
    [clojure.string :as str]
    [schema.core :as s]
    [tupelo.string :as ts]))

(def html-src
  "<div id=\"article\">
    <aside>unwanted text</aside>
    <p>Some text</p>
    <nav><ol><li><h2>unwanted text</h2></li></ol></nav>
    <p>More text</p>
    <h2>A headline</h2>
    <figure><figcaption>unwanted text</figcaption></figure>
    <p>More text</p>
    Here is a paragraph made of some raw text directly in the div
    <p>Another paragraph of text</p>
    More raw text and this one has an <a>anchor tag</a> inside
    <dl>
    <dd>unwanted text</dd>
    </dl>
    <p>Etc etc</p>
  </div> ")

首先,我们在删除所有换行符等后将 html 数据(一棵树)添加到森林中。这在内部使用Java “TagSoup”解析器

(dotest
  (hid-count-reset)
  (with-forest (new-forest)
    (let [root-hid            (add-tree-html
                                (ts/collapse-whitespace html-src))
          unwanted-node-paths (find-paths-with root-hid [:** :*]
                                (s/fn [path :- [HID]]
                                  (let [hid  (last path)
                                        node (hid->node hid)
                                        tag  (grab :tag node)]
                                    (or
                                      (= tag :aside)
                                      (= tag :nav)
                                      (= tag :figure)
                                      (= tag :dl)))))]
      (newline) (spyx-pretty :html-orig (hid->bush root-hid))

spyx-pretty显示数据的“灌木”格式(类似于打嗝格式):

:html-orig (hid->bush root-hid) =>
[{:tag :html}
 [{:tag :body}
  [{:id "article", :tag :div}
   [{:tag :aside, :value "unwanted text"}]
   [{:tag :p, :value "Some text"}]
   [{:tag :nav}
    [{:tag :ol} [{:tag :li} [{:tag :h2, :value "unwanted text"}]]]]
   [{:tag :p, :value "More text"}]
   [{:tag :h2, :value "A headline"}]
   [{:tag :figure} [{:tag :figcaption, :value "unwanted text"}]]
   [{:tag :p, :value "More text"}]
   [{:tag :tupelo.forest/raw,
     :value
     " Here is a paragraph made of some raw text directly in the div "}]
   [{:tag :p, :value "Another paragraph of text"}]
   [{:tag :tupelo.forest/raw,
     :value " More raw text and this one has an "}]
   [{:tag :a, :value "anchor tag"}]
   [{:tag :tupelo.forest/raw, :value " inside "}]
   [{:tag :dl} [{:tag :dd, :value "unwanted text"}]]
   [{:tag :p, :value "Etc etc"}]]]]

所以我们可以看到数据已经正确加载了。 现在,我们要删除find-paths-with标识的所有不需要的节点。 然后,打印修改后的树:

      (doseq [path unwanted-node-paths]
        (remove-path-subtree path))
      (newline) (spyx-pretty :html-cleaned (hid->bush root-hid))

:html-cleaned (hid->bush root-hid) =>
[{:tag :html}
 [{:tag :body}
  [{:id "article", :tag :div}
   [{:tag :p, :value "Some text"}]
   [{:tag :p, :value "More text"}]
   [{:tag :h2, :value "A headline"}]
   [{:tag :p, :value "More text"}]
   [{:tag :tupelo.forest/raw,
     :value
     " Here is a paragraph made of some raw text directly in the div "}]
   [{:tag :p, :value "Another paragraph of text"}]
   [{:tag :tupelo.forest/raw,
     :value " More raw text and this one has an "}]
   [{:tag :a, :value "anchor tag"}]
   [{:tag :tupelo.forest/raw, :value " inside "}]
   [{:tag :p, :value "Etc etc"}]]]]

此时,我们只需遍历树并将所有幸存的文本节点累积到一个向量中:

      (let [txt-accum (atom [])]
        (walk-tree root-hid
          {:enter (fn [path]
                    (let [hid   (last path)
                          node  (hid->node hid)
                          value (:value node)] ; may not be present
                      (when (string? value)
                        (swap! txt-accum append value))))})

为了验证,我们将找到的文本节点(忽略空格)与期望的结果进行比较:

        (is-nonblank=  (str/join \space @txt-accum)
          "Some text
           More text
           A headline
           More text
           Here is a paragraph made of some raw text directly in the div
           Another paragraph of text
           More raw text and this one has an
           anchor tag
            inside
           Etc etc")))))

有关更多详细信息,请参阅自述文件API 文档 请务必查看Lightning Talk以了解概览。

试试这个,如果你想用 javascript 的方式来做:):

 var element = document.getElementById('article'); var clnElem = element.cloneNode(true); Array.prototype.forEach.call(clnElem.children, function (elem) { if(elem.tagName === 'ASIDE' || elem.tagName === 'NAV' || elem.tagName === 'FIGURE') { elem.innerText = ''; } }); console.log(clnElem.textContent || clnElem.innerText);
 <div id="article"> <aside>...</aside> <p>Some text</p> <nav>...</nav> <p>More text</p> <h2>A headline</h2> <p>More text</p> Here is a paragraph made of some raw text directly in the div <p>Another paragraph of text</p> More raw text and this one has an <a>anchor tag</a> inside <p>Etc etc</p> </div>

您可以使用 javascript 的 textContent 或 innerText 属性来实现这一点,请找到以下代码片段:

 var element = document.getElementById('article'); console.log(element.textContent || element.innerText);
 <div id="article"> <aside>...</aside> <p>Some text</p> <nav>...</nav> <p>More text</p> <h2>A headline</h2> <p>More text</p> Here is a paragraph made of some raw text directly in the div <p>Another paragraph of text</p> More raw text and this one has an <a>anchor tag</a> inside <p>Etc etc</p> </div>

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM