如何選擇沒有HTML標記的文本

Question

我正在使用Web抓取工具（使用Python）工作，因此我有一大堆HTML，我正嘗試從中提取文本。 其中一個片段看起來像這樣：

<p class="something">This class has some <strong>text</strong> and a few <a href="http://www.example.com">links</a> in it.</p>

我想從此類中提取文本。 現在，我可以按照以下方式使用

//p[@class='something')]//text()

但這導致每個文本塊最終都成為一個單獨的結果元素，如下所示：

(This class has some ,text, and a few ,links, in it.)

所需的輸出將在一個元素中包含所有文本，如下所示：

This class has some text and a few links in it.

有沒有簡單或優雅的方法來實現這一目標？

編輯：這是產生上面給出的結果的代碼。

from lxml import html

html_snippet = '<p class="something">This class has some <strong>text</strong> and a few <a href="http://www.example.com">links</a> in it.</p>'

xpath_query = "//p[@class='something']//text()"

tree = html.fromstring(html_snippet)
query_results = tree.xpath(xpath_query)
for item in query_results:
    print "'{0}'".format(item)

Answer 1

您可以在XPath中使用normalize-space() 。 然后

from lxml import html

html_snippet = '<p class="something">This class has some <strong>text</strong> and a few <a href="http://www.example.com">links</a> in it.</p>'
xpath_query = "normalize-space(//p[@class='something'])"

tree = html.fromstring(html_snippet)
print tree.xpath(xpath_query)

將產生

This class has some text and a few links in it.

Answer 2

您可以在lxml元素上調用.text_content() ，而不是使用XPath來獲取文本。

from lxml import html

html_snippet = '<p class="something">This class has some <strong>text</strong> and a few <a href="http://www.example.com">links</a> in it.</p>'

xpath_query = "//p[@class='something']"

tree = html.fromstring(html_snippet)
query_results = tree.xpath(xpath_query)
for item in query_results:
    print "'{0}'".format(item.text_content())

Answer 3

你原來的代碼中選擇另外一個班輪：使用join一個空字符串分隔符：

print("".join(query_results))

如何選擇沒有HTML標記的文本

問題描述

3 個解決方案

解決方案1
3 2015-04-01 19:49:01

解決方案2
1 已采納 2015-04-01 19:49:07

解決方案3
0 2015-04-01 19:50:39

如何選擇沒有HTML標記的文本

問題描述

3 個解決方案

解決方案1 3 2015-04-01 19:49:01

解決方案2 1 已采納 2015-04-01 19:49:07

解決方案3 0 2015-04-01 19:50:39

解決方案1
3 2015-04-01 19:49:01

解決方案2
1 已采納 2015-04-01 19:49:07

解決方案3
0 2015-04-01 19:50:39