使用lxml.html提取文本

Question

I have a HTML file: 我有一个HTML文件：

<html>
    <p>somestr
        <sup>1</sup>
       anotherstr
    </p>
</html>

I would like to extract the text as: 我想将文本提取为：

somestr ¹ anotherstr somestr ¹ anotherstr

but I can't figure out how to do it. 但我无法弄清楚该怎么做。 I have written a to_sup() function that converts numeric strings to superscript so the closest I get is something like: 我写了一个to_sup()函数，它将数字字符串转换为上标，所以我得到的最接近的是：

for i in doc.xpath('.//p/text()|.//sup/text()'):
    if i.tag == 'sup':
        print to_sup(i),
    else:
        print i,

but I ElementStringResult doesn't seem to have a method to get the tag name, so I am a bit lost. 但我的ElementStringResult似乎没有一个方法来获取标签名称，所以我有点迷失。 Any ideas how to solve it? 任何想法如何解决？

Answer 1

first solution (concatenates text with no separator - see also python [lxml] - cleaning out html tags ): 第一个解决方案（连接没有分隔符的文本 - 另见python [lxml] - 清除html标签）：

   import lxml.html
   document = lxml.html.document_fromstring(html_string)
   # internally does: etree.XPath("string()")(document)
   print document.text_content()

this one helped me - concatenation the way I needed: 这一个帮助了我 - 连接我需要的方式：

   from lxml import etree
   print "\n".join(etree.XPath("//text()")(document))

Answer 2

Just don't call text() on the sup nodes in the XPath. 只是不要在XPath中的sup节点上调用text() 。

for x in doc.xpath("//p/text()|//sup"):
    try:
        print(to_sup(x.text))
    except AttributeError:
        print(x)

使用lxml.html提取文本

问题描述

2 个解决方案

解决方案1
8 2014-05-29 08:48:10

解决方案2
3 已采纳 2012-12-17 10:43:27

使用lxml.html提取文本

问题描述

2 个解决方案

解决方案1 8 2014-05-29 08:48:10

解决方案2 3 已采纳 2012-12-17 10:43:27

解决方案1
8 2014-05-29 08:48:10

解决方案2
3 已采纳 2012-12-17 10:43:27