[英]Extract text with lxml.html
I have a HTML file: 我有一个HTML文件:
<html>
<p>somestr
<sup>1</sup>
anotherstr
</p>
</html>
I would like to extract the text as: 我想将文本提取为:
somestr 1 anotherstr somestr 1 anotherstr
but I can't figure out how to do it. 但我无法弄清楚该怎么做。 I have written a to_sup()
function that converts numeric strings to superscript so the closest I get is something like: 我写了一个to_sup()
函数,它将数字字符串转换为上标,所以我得到的最接近的是:
for i in doc.xpath('.//p/text()|.//sup/text()'):
if i.tag == 'sup':
print to_sup(i),
else:
print i,
but I ElementStringResult
doesn't seem to have a method to get the tag name, so I am a bit lost. 但我的ElementStringResult
似乎没有一个方法来获取标签名称,所以我有点迷失。 Any ideas how to solve it? 任何想法如何解决?
first solution (concatenates text with no separator - see also python [lxml] - cleaning out html tags ): 第一个解决方案(连接没有分隔符的文本 - 另见python [lxml] - 清除html标签 ):
import lxml.html
document = lxml.html.document_fromstring(html_string)
# internally does: etree.XPath("string()")(document)
print document.text_content()
this one helped me - concatenation the way I needed: 这一个帮助了我 - 连接我需要的方式:
from lxml import etree
print "\n".join(etree.XPath("//text()")(document))
Just don't call text()
on the sup
nodes in the XPath. 只是不要在XPath中的sup
节点上调用text()
。
for x in doc.xpath("//p/text()|//sup"):
try:
print(to_sup(x.text))
except AttributeError:
print(x)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.