簡體   English   中英

如何選擇沒有HTML標記的文本

[英]How to select text without the HTML markup

我正在使用Web抓取工具(使用Python)工作,因此我有一大堆HTML,我正嘗試從中提取文本。 其中一個片段看起來像這樣:

<p class="something">This class has some <strong>text</strong> and a few <a href="http://www.example.com">links</a> in it.</p>

我想從此類中提取文本。 現在,我可以按照以下方式使用

//p[@class='something')]//text()

但這導致每個文本塊最終都成為一個單獨的結果元素,如下所示:

(This class has some ,text, and a few ,links, in it.)

所需的輸出將在一個元素中包含所有文本,如下所示:

This class has some text and a few links in it.

有沒有簡單或優雅的方法來實現這一目標?

編輯 :這是產生上面給出的結果的代碼。

from lxml import html

html_snippet = '<p class="something">This class has some <strong>text</strong> and a few <a href="http://www.example.com">links</a> in it.</p>'

xpath_query = "//p[@class='something']//text()"

tree = html.fromstring(html_snippet)
query_results = tree.xpath(xpath_query)
for item in query_results:
    print "'{0}'".format(item)

您可以在XPath中使用normalize-space() 然后

from lxml import html

html_snippet = '<p class="something">This class has some <strong>text</strong> and a few <a href="http://www.example.com">links</a> in it.</p>'
xpath_query = "normalize-space(//p[@class='something'])"

tree = html.fromstring(html_snippet)
print tree.xpath(xpath_query)

將產生

This class has some text and a few links in it.

您可以在lxml元素上調用.text_content() ,而不是使用XPath來獲取文本。

from lxml import html

html_snippet = '<p class="something">This class has some <strong>text</strong> and a few <a href="http://www.example.com">links</a> in it.</p>'

xpath_query = "//p[@class='something']"

tree = html.fromstring(html_snippet)
query_results = tree.xpath(xpath_query)
for item in query_results:
    print "'{0}'".format(item.text_content())

你原來的代碼中選擇另外一個班輪:使用join一個空字符串分隔符:

print("".join(query_results))

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM