[英]How to select text without the HTML markup
我正在使用Web抓取工具(使用Python)工作,因此我有一大堆HTML,我正嘗試從中提取文本。 其中一個片段看起來像這樣:
<p class="something">This class has some <strong>text</strong> and a few <a href="http://www.example.com">links</a> in it.</p>
我想從此類中提取文本。 現在,我可以按照以下方式使用
//p[@class='something')]//text()
但這導致每個文本塊最終都成為一個單獨的結果元素,如下所示:
(This class has some ,text, and a few ,links, in it.)
所需的輸出將在一個元素中包含所有文本,如下所示:
This class has some text and a few links in it.
有沒有簡單或優雅的方法來實現這一目標?
編輯 :這是產生上面給出的結果的代碼。
from lxml import html
html_snippet = '<p class="something">This class has some <strong>text</strong> and a few <a href="http://www.example.com">links</a> in it.</p>'
xpath_query = "//p[@class='something']//text()"
tree = html.fromstring(html_snippet)
query_results = tree.xpath(xpath_query)
for item in query_results:
print "'{0}'".format(item)
您可以在XPath中使用normalize-space()
。 然后
from lxml import html
html_snippet = '<p class="something">This class has some <strong>text</strong> and a few <a href="http://www.example.com">links</a> in it.</p>'
xpath_query = "normalize-space(//p[@class='something'])"
tree = html.fromstring(html_snippet)
print tree.xpath(xpath_query)
將產生
This class has some text and a few links in it.
您可以在lxml元素上調用.text_content()
,而不是使用XPath來獲取文本。
from lxml import html
html_snippet = '<p class="something">This class has some <strong>text</strong> and a few <a href="http://www.example.com">links</a> in it.</p>'
xpath_query = "//p[@class='something']"
tree = html.fromstring(html_snippet)
query_results = tree.xpath(xpath_query)
for item in query_results:
print "'{0}'".format(item.text_content())
你原來的代碼中選擇另外一個班輪:使用join
一個空字符串分隔符:
print("".join(query_results))
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.