[英]How can I extract text from an HTML element containing a mix of `p` tags and inner text?
I'm scraping a website with some poorly structured HTML using a Clojure wrapper around jsoup called Reaver .我正在使用名为Reaver的 jsoup 周围的 Clojure 包装器来抓取一个结构不佳的 HTML 网站。 Here is an example of some of the HTML structure:
下面是一些 HTML 结构的示例:
<div id="article">
<aside>unwanted text</aside>
<p>Some text</p>
<nav><ol><li><h2>unwanted text</h2></li></ol></nav>
<p>More text</p>
<h2>A headline</h2>
<figure><figcaption>unwanted text</figcaption></figure>
<p>More text</p>
Here is a paragraph made of some raw text directly in the div
<p>Another paragraph of text</p>
More raw text and this one has an <a>anchor tag</a> inside
<dl>
<dd>unwanted text</dd>
</dl>
<p>Etc etc</p>
</div>
This div
represents an article on a wiki.这个
div
代表 wiki 上的一篇文章。 I want to extract the text from it, but as you can see, some paragraphs are in p
tags, and some are contained directly within the div.我想从中提取文本,但如您所见,有些段落在
p
标签中,有些则直接包含在 div 中。 I also need the headlines and anchor tag text.我还需要标题和锚标记文本。
I know how to parse and extract the text from all of the p
, a
, and h
tags, and I can select for the div
and extract the inner text from it, but the problem is that I end up with two selections of text that I need to merge somehow.我知道如何从所有
p
、 a
和h
标签中解析和提取文本,我可以为div
使用 select 并从中提取内部文本,但问题是我最终选择了两个文本我需要以某种方式合并。
How can I extract the text from this div, so that all of the text from the p
, a
, h
tags, as well as the inner text on the div
, are extracted in order?如何从此 div 中提取文本,以便按顺序提取
p
、 a
、 h
标签中的所有文本以及div
上的内部文本? The result should be paragraphs of text in the same order as what is in the HTML.结果应该是文本段落,其顺序与 HTML 中的顺序相同。
Here is what I am currently using to extract, but the inner div
text is missing from the results:这是我目前用来提取的内容,但结果中缺少内部
div
文本:
(defn get-texts [url]
(:paragraphs (extract (parse (slurp url))
[:paragraphs]
"#article > *:not(aside, nav, table, figure, dl)" text)))
Note also that additional unwanted elements appear in this div
, eg, aside
, figure
, etc. These elements contain text, as well as nested elements with text, that should not be included in the result.另请注意,此
div
中还会出现其他不需要的元素,例如, aside
、 figure
等。这些元素包含文本,以及带有文本的嵌套元素,不应包含在结果中。
You could extract the entire article as a JSoup object (likely an Element
), then convert it to an EDN representation using reaver/to-edn
.您可以将整篇文章提取为 JSoup object(可能是
Element
),然后使用reaver/to-edn
将其转换为 EDN 表示。 Then you go through the :content
of that and handle both strings (the result of TextNode
s) and elements that have a :tag
that interests you.然后你 go 通过
:content
处理字符串( TextNode
的结果)和具有你感兴趣的:tag
的元素。
(Code by vaer-k) (由 vaer-k 编写)
(defn get-article [url]
(:article (extract (parse (slurp url))
[:article]
"#article"
edn)))
(defn text-elem?
[element]
(or (string? element)
(contains? #{:p :a :b :i} (:tag element))))
(defn extract-text
[{content :content}]
(let [text-children (filter text-elem? content)]
(reduce #(if (string? %2)
(str %1 %2)
(str %1 (extract-text %2)))
""
text-children)))
(defn extract-article [url]
(-> url
get-article
extract-text))
You can solve this using the tupelo.forest
library, which was presented in an "Unsession" of the Clojure/Conj 2019 just last week.您可以使用
tupelo.forest
库来解决这个问题,该库于上周在 Clojure/Conj 2019 的“Unsession”中进行了介绍。
Below is the solution written as a unit test.以下是作为单元测试编写的解决方案。 First some declarations and the sample data:
首先是一些声明和示例数据:
(ns tst.demo.core
(:use tupelo.forest tupelo.core tupelo.test)
(:require
[clojure.string :as str]
[schema.core :as s]
[tupelo.string :as ts]))
(def html-src
"<div id=\"article\">
<aside>unwanted text</aside>
<p>Some text</p>
<nav><ol><li><h2>unwanted text</h2></li></ol></nav>
<p>More text</p>
<h2>A headline</h2>
<figure><figcaption>unwanted text</figcaption></figure>
<p>More text</p>
Here is a paragraph made of some raw text directly in the div
<p>Another paragraph of text</p>
More raw text and this one has an <a>anchor tag</a> inside
<dl>
<dd>unwanted text</dd>
</dl>
<p>Etc etc</p>
</div> ")
To start off, we add the html data (a tree) to the forest after removing all newlines, etc. This uses the Java "TagSoup" parser internally:首先,我们在删除所有换行符等后将 html 数据(一棵树)添加到森林中。这在内部使用Java “TagSoup”解析器:
(dotest
(hid-count-reset)
(with-forest (new-forest)
(let [root-hid (add-tree-html
(ts/collapse-whitespace html-src))
unwanted-node-paths (find-paths-with root-hid [:** :*]
(s/fn [path :- [HID]]
(let [hid (last path)
node (hid->node hid)
tag (grab :tag node)]
(or
(= tag :aside)
(= tag :nav)
(= tag :figure)
(= tag :dl)))))]
(newline) (spyx-pretty :html-orig (hid->bush root-hid))
The spyx-pretty
shows the "bush" format of the data (similar to hiccup format): spyx-pretty
显示数据的“灌木”格式(类似于打嗝格式):
:html-orig (hid->bush root-hid) =>
[{:tag :html}
[{:tag :body}
[{:id "article", :tag :div}
[{:tag :aside, :value "unwanted text"}]
[{:tag :p, :value "Some text"}]
[{:tag :nav}
[{:tag :ol} [{:tag :li} [{:tag :h2, :value "unwanted text"}]]]]
[{:tag :p, :value "More text"}]
[{:tag :h2, :value "A headline"}]
[{:tag :figure} [{:tag :figcaption, :value "unwanted text"}]]
[{:tag :p, :value "More text"}]
[{:tag :tupelo.forest/raw,
:value
" Here is a paragraph made of some raw text directly in the div "}]
[{:tag :p, :value "Another paragraph of text"}]
[{:tag :tupelo.forest/raw,
:value " More raw text and this one has an "}]
[{:tag :a, :value "anchor tag"}]
[{:tag :tupelo.forest/raw, :value " inside "}]
[{:tag :dl} [{:tag :dd, :value "unwanted text"}]]
[{:tag :p, :value "Etc etc"}]]]]
So we can see the data has been loaded correctly.所以我们可以看到数据已经正确加载了。 Now, we want to remove all of the unwanted nodes as identified by the
find-paths-with
.现在,我们要删除
find-paths-with
标识的所有不需要的节点。 Then, print the modified tree:然后,打印修改后的树:
(doseq [path unwanted-node-paths]
(remove-path-subtree path))
(newline) (spyx-pretty :html-cleaned (hid->bush root-hid))
:html-cleaned (hid->bush root-hid) =>
[{:tag :html}
[{:tag :body}
[{:id "article", :tag :div}
[{:tag :p, :value "Some text"}]
[{:tag :p, :value "More text"}]
[{:tag :h2, :value "A headline"}]
[{:tag :p, :value "More text"}]
[{:tag :tupelo.forest/raw,
:value
" Here is a paragraph made of some raw text directly in the div "}]
[{:tag :p, :value "Another paragraph of text"}]
[{:tag :tupelo.forest/raw,
:value " More raw text and this one has an "}]
[{:tag :a, :value "anchor tag"}]
[{:tag :tupelo.forest/raw, :value " inside "}]
[{:tag :p, :value "Etc etc"}]]]]
At this point, we simply walk the tree and accumulate any surviving text nodes into a vector:此时,我们只需遍历树并将所有幸存的文本节点累积到一个向量中:
(let [txt-accum (atom [])]
(walk-tree root-hid
{:enter (fn [path]
(let [hid (last path)
node (hid->node hid)
value (:value node)] ; may not be present
(when (string? value)
(swap! txt-accum append value))))})
To verify, we compare the found text nodes (ignoring whitespace) to the desired result:为了验证,我们将找到的文本节点(忽略空格)与期望的结果进行比较:
(is-nonblank= (str/join \space @txt-accum)
"Some text
More text
A headline
More text
Here is a paragraph made of some raw text directly in the div
Another paragraph of text
More raw text and this one has an
anchor tag
inside
Etc etc")))))
For more details, see the README file and the API docs .有关更多详细信息,请参阅自述文件和API 文档。 Be sure to also view the Lightning Talk for an overview.
请务必查看Lightning Talk以了解概览。
Try this, if you wish to do it in javascript way:):试试这个,如果你想用 javascript 的方式来做:):
var element = document.getElementById('article'); var clnElem = element.cloneNode(true); Array.prototype.forEach.call(clnElem.children, function (elem) { if(elem.tagName === 'ASIDE' || elem.tagName === 'NAV' || elem.tagName === 'FIGURE') { elem.innerText = ''; } }); console.log(clnElem.textContent || clnElem.innerText);
<div id="article"> <aside>...</aside> <p>Some text</p> <nav>...</nav> <p>More text</p> <h2>A headline</h2> <p>More text</p> Here is a paragraph made of some raw text directly in the div <p>Another paragraph of text</p> More raw text and this one has an <a>anchor tag</a> inside <p>Etc etc</p> </div>
You can achieve this using textContent or innerText property of javascript, please find below code snippet:您可以使用 javascript 的 textContent 或 innerText 属性来实现这一点,请找到以下代码片段:
var element = document.getElementById('article'); console.log(element.textContent || element.innerText);
<div id="article"> <aside>...</aside> <p>Some text</p> <nav>...</nav> <p>More text</p> <h2>A headline</h2> <p>More text</p> Here is a paragraph made of some raw text directly in the div <p>Another paragraph of text</p> More raw text and this one has an <a>anchor tag</a> inside <p>Etc etc</p> </div>
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.