简体   繁体   English

如何从包含“p”标签和内部文本混合的 HTML 元素中提取文本?

[英]How can I extract text from an HTML element containing a mix of `p` tags and inner text?

I'm scraping a website with some poorly structured HTML using a Clojure wrapper around jsoup called Reaver .我正在使用名为Reaver的 jsoup 周围的 Clojure 包装器来抓取一个结构不佳的 HTML 网站。 Here is an example of some of the HTML structure:下面是一些 HTML 结构的示例:

<div id="article">
  <aside>unwanted text</aside>
  <p>Some text</p>
  <nav><ol><li><h2>unwanted text</h2></li></ol></nav>
  <p>More text</p>
  <h2>A headline</h2>
  <figure><figcaption>unwanted text</figcaption></figure>
  <p>More text</p>
  Here is a paragraph made of some raw text directly in the div
  <p>Another paragraph of text</p>
  More raw text and this one has an <a>anchor tag</a> inside
  <dl>
    <dd>unwanted text</dd>
  </dl>
  <p>Etc etc</p>
</div>

This div represents an article on a wiki.这个div代表 wiki 上的一篇文章。 I want to extract the text from it, but as you can see, some paragraphs are in p tags, and some are contained directly within the div.我想从中提取文本,但如您所见,有些段落在p标签中,有些则直接包含在 div 中。 I also need the headlines and anchor tag text.我还需要标题和锚标记文本。

I know how to parse and extract the text from all of the p , a , and h tags, and I can select for the div and extract the inner text from it, but the problem is that I end up with two selections of text that I need to merge somehow.我知道如何从所有pah标签中解析和提取文本,我可以为div使用 select 并从中提取内部文本,但问题是我最终选择了两个文本我需要以某种方式合并。

How can I extract the text from this div, so that all of the text from the p , a , h tags, as well as the inner text on the div , are extracted in order?如何从此 div 中提取文本,以便按顺序提取pah标签中的所有文本以及div上的内部文本? The result should be paragraphs of text in the same order as what is in the HTML.结果应该是文本段落,其顺序与 HTML 中的顺序相同。

Here is what I am currently using to extract, but the inner div text is missing from the results:这是我目前用来提取的内容,但结果中缺少内部div文本:

(defn get-texts [url]
  (:paragraphs (extract (parse (slurp url))
                        [:paragraphs]
                        "#article > *:not(aside, nav, table, figure, dl)" text)))

Note also that additional unwanted elements appear in this div , eg, aside , figure , etc. These elements contain text, as well as nested elements with text, that should not be included in the result.另请注意,此div中还会出现其他不需要的元素,例如, asidefigure等。这些元素包含文本,以及带有文本的嵌套元素,不应包含在结果中。

You could extract the entire article as a JSoup object (likely an Element ), then convert it to an EDN representation using reaver/to-edn .您可以将整篇文章提取为 JSoup object(可能是Element ),然后使用reaver/to-edn将其转换为 EDN 表示。 Then you go through the :content of that and handle both strings (the result of TextNode s) and elements that have a :tag that interests you.然后你 go 通过:content处理字符串( TextNode的结果)和具有你感兴趣的:tag的元素。

(Code by vaer-k) (由 vaer-k 编写)

(defn get-article [url]
  (:article (extract (parse (slurp url))
                     [:article]
                     "#article"
                     edn)))

(defn text-elem?
  [element]
  (or (string? element)
      (contains? #{:p :a :b :i} (:tag element))))

(defn extract-text
  [{content :content}]
  (let [text-children (filter text-elem? content)]
    (reduce #(if (string? %2)
               (str %1 %2)
               (str %1 (extract-text %2)))
            ""
            text-children)))

(defn extract-article [url]
  (-> url
      get-article
      extract-text))

You can solve this using the tupelo.forest library, which was presented in an "Unsession" of the Clojure/Conj 2019 just last week.您可以使用tupelo.forest库来解决这个问题,该库于上周在 Clojure/Conj 2019 的“Unsession”中进行了介绍。

Below is the solution written as a unit test.以下是作为单元测试编写的解决方案。 First some declarations and the sample data:首先是一些声明和示例数据:

(ns tst.demo.core
  (:use tupelo.forest tupelo.core tupelo.test)
  (:require
    [clojure.string :as str]
    [schema.core :as s]
    [tupelo.string :as ts]))

(def html-src
  "<div id=\"article\">
    <aside>unwanted text</aside>
    <p>Some text</p>
    <nav><ol><li><h2>unwanted text</h2></li></ol></nav>
    <p>More text</p>
    <h2>A headline</h2>
    <figure><figcaption>unwanted text</figcaption></figure>
    <p>More text</p>
    Here is a paragraph made of some raw text directly in the div
    <p>Another paragraph of text</p>
    More raw text and this one has an <a>anchor tag</a> inside
    <dl>
    <dd>unwanted text</dd>
    </dl>
    <p>Etc etc</p>
  </div> ")

To start off, we add the html data (a tree) to the forest after removing all newlines, etc. This uses the Java "TagSoup" parser internally:首先,我们在删除所有换行符等后将 html 数据(一棵树)添加到森林中。这在内部使用Java “TagSoup”解析器

(dotest
  (hid-count-reset)
  (with-forest (new-forest)
    (let [root-hid            (add-tree-html
                                (ts/collapse-whitespace html-src))
          unwanted-node-paths (find-paths-with root-hid [:** :*]
                                (s/fn [path :- [HID]]
                                  (let [hid  (last path)
                                        node (hid->node hid)
                                        tag  (grab :tag node)]
                                    (or
                                      (= tag :aside)
                                      (= tag :nav)
                                      (= tag :figure)
                                      (= tag :dl)))))]
      (newline) (spyx-pretty :html-orig (hid->bush root-hid))

The spyx-pretty shows the "bush" format of the data (similar to hiccup format): spyx-pretty显示数据的“灌木”格式(类似于打嗝格式):

:html-orig (hid->bush root-hid) =>
[{:tag :html}
 [{:tag :body}
  [{:id "article", :tag :div}
   [{:tag :aside, :value "unwanted text"}]
   [{:tag :p, :value "Some text"}]
   [{:tag :nav}
    [{:tag :ol} [{:tag :li} [{:tag :h2, :value "unwanted text"}]]]]
   [{:tag :p, :value "More text"}]
   [{:tag :h2, :value "A headline"}]
   [{:tag :figure} [{:tag :figcaption, :value "unwanted text"}]]
   [{:tag :p, :value "More text"}]
   [{:tag :tupelo.forest/raw,
     :value
     " Here is a paragraph made of some raw text directly in the div "}]
   [{:tag :p, :value "Another paragraph of text"}]
   [{:tag :tupelo.forest/raw,
     :value " More raw text and this one has an "}]
   [{:tag :a, :value "anchor tag"}]
   [{:tag :tupelo.forest/raw, :value " inside "}]
   [{:tag :dl} [{:tag :dd, :value "unwanted text"}]]
   [{:tag :p, :value "Etc etc"}]]]]

So we can see the data has been loaded correctly.所以我们可以看到数据已经正确加载了。 Now, we want to remove all of the unwanted nodes as identified by the find-paths-with .现在,我们要删除find-paths-with标识的所有不需要的节点。 Then, print the modified tree:然后,打印修改后的树:

      (doseq [path unwanted-node-paths]
        (remove-path-subtree path))
      (newline) (spyx-pretty :html-cleaned (hid->bush root-hid))

:html-cleaned (hid->bush root-hid) =>
[{:tag :html}
 [{:tag :body}
  [{:id "article", :tag :div}
   [{:tag :p, :value "Some text"}]
   [{:tag :p, :value "More text"}]
   [{:tag :h2, :value "A headline"}]
   [{:tag :p, :value "More text"}]
   [{:tag :tupelo.forest/raw,
     :value
     " Here is a paragraph made of some raw text directly in the div "}]
   [{:tag :p, :value "Another paragraph of text"}]
   [{:tag :tupelo.forest/raw,
     :value " More raw text and this one has an "}]
   [{:tag :a, :value "anchor tag"}]
   [{:tag :tupelo.forest/raw, :value " inside "}]
   [{:tag :p, :value "Etc etc"}]]]]

At this point, we simply walk the tree and accumulate any surviving text nodes into a vector:此时,我们只需遍历树并将所有幸存的文本节点累积到一个向量中:

      (let [txt-accum (atom [])]
        (walk-tree root-hid
          {:enter (fn [path]
                    (let [hid   (last path)
                          node  (hid->node hid)
                          value (:value node)] ; may not be present
                      (when (string? value)
                        (swap! txt-accum append value))))})

To verify, we compare the found text nodes (ignoring whitespace) to the desired result:为了验证,我们将找到的文本节点(忽略空格)与期望的结果进行比较:

        (is-nonblank=  (str/join \space @txt-accum)
          "Some text
           More text
           A headline
           More text
           Here is a paragraph made of some raw text directly in the div
           Another paragraph of text
           More raw text and this one has an
           anchor tag
            inside
           Etc etc")))))

For more details, see the README file and the API docs .有关更多详细信息,请参阅自述文件API 文档 Be sure to also view the Lightning Talk for an overview.请务必查看Lightning Talk以了解概览。

Try this, if you wish to do it in javascript way:):试试这个,如果你想用 javascript 的方式来做:):

 var element = document.getElementById('article'); var clnElem = element.cloneNode(true); Array.prototype.forEach.call(clnElem.children, function (elem) { if(elem.tagName === 'ASIDE' || elem.tagName === 'NAV' || elem.tagName === 'FIGURE') { elem.innerText = ''; } }); console.log(clnElem.textContent || clnElem.innerText);
 <div id="article"> <aside>...</aside> <p>Some text</p> <nav>...</nav> <p>More text</p> <h2>A headline</h2> <p>More text</p> Here is a paragraph made of some raw text directly in the div <p>Another paragraph of text</p> More raw text and this one has an <a>anchor tag</a> inside <p>Etc etc</p> </div>

You can achieve this using textContent or innerText property of javascript, please find below code snippet:您可以使用 javascript 的 textContent 或 innerText 属性来实现这一点,请找到以下代码片段:

 var element = document.getElementById('article'); console.log(element.textContent || element.innerText);
 <div id="article"> <aside>...</aside> <p>Some text</p> <nav>...</nav> <p>More text</p> <h2>A headline</h2> <p>More text</p> Here is a paragraph made of some raw text directly in the div <p>Another paragraph of text</p> More raw text and this one has an <a>anchor tag</a> inside <p>Etc etc</p> </div>

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM