简体   繁体   中英

How can I extract text from an HTML element containing a mix of `p` tags and inner text?

I'm scraping a website with some poorly structured HTML using a Clojure wrapper around jsoup called Reaver . Here is an example of some of the HTML structure:

<div id="article">
  <aside>unwanted text</aside>
  <p>Some text</p>
  <nav><ol><li><h2>unwanted text</h2></li></ol></nav>
  <p>More text</p>
  <h2>A headline</h2>
  <figure><figcaption>unwanted text</figcaption></figure>
  <p>More text</p>
  Here is a paragraph made of some raw text directly in the div
  <p>Another paragraph of text</p>
  More raw text and this one has an <a>anchor tag</a> inside
  <dl>
    <dd>unwanted text</dd>
  </dl>
  <p>Etc etc</p>
</div>

This div represents an article on a wiki. I want to extract the text from it, but as you can see, some paragraphs are in p tags, and some are contained directly within the div. I also need the headlines and anchor tag text.

I know how to parse and extract the text from all of the p , a , and h tags, and I can select for the div and extract the inner text from it, but the problem is that I end up with two selections of text that I need to merge somehow.

How can I extract the text from this div, so that all of the text from the p , a , h tags, as well as the inner text on the div , are extracted in order? The result should be paragraphs of text in the same order as what is in the HTML.

Here is what I am currently using to extract, but the inner div text is missing from the results:

(defn get-texts [url]
  (:paragraphs (extract (parse (slurp url))
                        [:paragraphs]
                        "#article > *:not(aside, nav, table, figure, dl)" text)))

Note also that additional unwanted elements appear in this div , eg, aside , figure , etc. These elements contain text, as well as nested elements with text, that should not be included in the result.

You could extract the entire article as a JSoup object (likely an Element ), then convert it to an EDN representation using reaver/to-edn . Then you go through the :content of that and handle both strings (the result of TextNode s) and elements that have a :tag that interests you.

(Code by vaer-k)

(defn get-article [url]
  (:article (extract (parse (slurp url))
                     [:article]
                     "#article"
                     edn)))

(defn text-elem?
  [element]
  (or (string? element)
      (contains? #{:p :a :b :i} (:tag element))))

(defn extract-text
  [{content :content}]
  (let [text-children (filter text-elem? content)]
    (reduce #(if (string? %2)
               (str %1 %2)
               (str %1 (extract-text %2)))
            ""
            text-children)))

(defn extract-article [url]
  (-> url
      get-article
      extract-text))

You can solve this using the tupelo.forest library, which was presented in an "Unsession" of the Clojure/Conj 2019 just last week.

Below is the solution written as a unit test. First some declarations and the sample data:

(ns tst.demo.core
  (:use tupelo.forest tupelo.core tupelo.test)
  (:require
    [clojure.string :as str]
    [schema.core :as s]
    [tupelo.string :as ts]))

(def html-src
  "<div id=\"article\">
    <aside>unwanted text</aside>
    <p>Some text</p>
    <nav><ol><li><h2>unwanted text</h2></li></ol></nav>
    <p>More text</p>
    <h2>A headline</h2>
    <figure><figcaption>unwanted text</figcaption></figure>
    <p>More text</p>
    Here is a paragraph made of some raw text directly in the div
    <p>Another paragraph of text</p>
    More raw text and this one has an <a>anchor tag</a> inside
    <dl>
    <dd>unwanted text</dd>
    </dl>
    <p>Etc etc</p>
  </div> ")

To start off, we add the html data (a tree) to the forest after removing all newlines, etc. This uses the Java "TagSoup" parser internally:

(dotest
  (hid-count-reset)
  (with-forest (new-forest)
    (let [root-hid            (add-tree-html
                                (ts/collapse-whitespace html-src))
          unwanted-node-paths (find-paths-with root-hid [:** :*]
                                (s/fn [path :- [HID]]
                                  (let [hid  (last path)
                                        node (hid->node hid)
                                        tag  (grab :tag node)]
                                    (or
                                      (= tag :aside)
                                      (= tag :nav)
                                      (= tag :figure)
                                      (= tag :dl)))))]
      (newline) (spyx-pretty :html-orig (hid->bush root-hid))

The spyx-pretty shows the "bush" format of the data (similar to hiccup format):

:html-orig (hid->bush root-hid) =>
[{:tag :html}
 [{:tag :body}
  [{:id "article", :tag :div}
   [{:tag :aside, :value "unwanted text"}]
   [{:tag :p, :value "Some text"}]
   [{:tag :nav}
    [{:tag :ol} [{:tag :li} [{:tag :h2, :value "unwanted text"}]]]]
   [{:tag :p, :value "More text"}]
   [{:tag :h2, :value "A headline"}]
   [{:tag :figure} [{:tag :figcaption, :value "unwanted text"}]]
   [{:tag :p, :value "More text"}]
   [{:tag :tupelo.forest/raw,
     :value
     " Here is a paragraph made of some raw text directly in the div "}]
   [{:tag :p, :value "Another paragraph of text"}]
   [{:tag :tupelo.forest/raw,
     :value " More raw text and this one has an "}]
   [{:tag :a, :value "anchor tag"}]
   [{:tag :tupelo.forest/raw, :value " inside "}]
   [{:tag :dl} [{:tag :dd, :value "unwanted text"}]]
   [{:tag :p, :value "Etc etc"}]]]]

So we can see the data has been loaded correctly. Now, we want to remove all of the unwanted nodes as identified by the find-paths-with . Then, print the modified tree:

      (doseq [path unwanted-node-paths]
        (remove-path-subtree path))
      (newline) (spyx-pretty :html-cleaned (hid->bush root-hid))

:html-cleaned (hid->bush root-hid) =>
[{:tag :html}
 [{:tag :body}
  [{:id "article", :tag :div}
   [{:tag :p, :value "Some text"}]
   [{:tag :p, :value "More text"}]
   [{:tag :h2, :value "A headline"}]
   [{:tag :p, :value "More text"}]
   [{:tag :tupelo.forest/raw,
     :value
     " Here is a paragraph made of some raw text directly in the div "}]
   [{:tag :p, :value "Another paragraph of text"}]
   [{:tag :tupelo.forest/raw,
     :value " More raw text and this one has an "}]
   [{:tag :a, :value "anchor tag"}]
   [{:tag :tupelo.forest/raw, :value " inside "}]
   [{:tag :p, :value "Etc etc"}]]]]

At this point, we simply walk the tree and accumulate any surviving text nodes into a vector:

      (let [txt-accum (atom [])]
        (walk-tree root-hid
          {:enter (fn [path]
                    (let [hid   (last path)
                          node  (hid->node hid)
                          value (:value node)] ; may not be present
                      (when (string? value)
                        (swap! txt-accum append value))))})

To verify, we compare the found text nodes (ignoring whitespace) to the desired result:

        (is-nonblank=  (str/join \space @txt-accum)
          "Some text
           More text
           A headline
           More text
           Here is a paragraph made of some raw text directly in the div
           Another paragraph of text
           More raw text and this one has an
           anchor tag
            inside
           Etc etc")))))

For more details, see the README file and the API docs . Be sure to also view the Lightning Talk for an overview.

Try this, if you wish to do it in javascript way:):

 var element = document.getElementById('article'); var clnElem = element.cloneNode(true); Array.prototype.forEach.call(clnElem.children, function (elem) { if(elem.tagName === 'ASIDE' || elem.tagName === 'NAV' || elem.tagName === 'FIGURE') { elem.innerText = ''; } }); console.log(clnElem.textContent || clnElem.innerText);
 <div id="article"> <aside>...</aside> <p>Some text</p> <nav>...</nav> <p>More text</p> <h2>A headline</h2> <p>More text</p> Here is a paragraph made of some raw text directly in the div <p>Another paragraph of text</p> More raw text and this one has an <a>anchor tag</a> inside <p>Etc etc</p> </div>

You can achieve this using textContent or innerText property of javascript, please find below code snippet:

 var element = document.getElementById('article'); console.log(element.textContent || element.innerText);
 <div id="article"> <aside>...</aside> <p>Some text</p> <nav>...</nav> <p>More text</p> <h2>A headline</h2> <p>More text</p> Here is a paragraph made of some raw text directly in the div <p>Another paragraph of text</p> More raw text and this one has an <a>anchor tag</a> inside <p>Etc etc</p> </div>

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM