简体   繁体   English

Python + LXML:如何查找标签的名称空间?

[英]Python + lxml: how to find the namespace of a tag?

I am processing some HTML files with python + lxml. 我正在使用python + lxml处理一些HTML文件。 Some of them have been edited with MS Word, and we have <p> tags written as <o:p>&nbsp</o:p> for instance. 其中一些已经用MS Word编辑过,例如,我们有<p>标记写为<o:p>&nbsp</o:p> IE and Firefox do not interpret these MS tags as real <p> tags, and do not display line breaks before and after the <o:p> tags, and that is how the original editors have formatted the files, eg no spaces around the nbsp's. IE和Firefox不会将这些MS标记解释为真实的<p>标记,并且不会在<o:p>标记之前和之后显示换行符,这就是原始编辑器格式化文件的方式,例如, NBSP的。

lxml on the other hand is tidy, and after processing the HTML files, we see that all the <o:p> tags have been changed to proper <p> tags. 另一方面,lxml很整洁,在处理完HTML文件之后,我们看到所有<o:p>标签都已更改为正确的<p>标签。 Unfortunately after this tidying up both browsers now display line breaks around all nbsp's, which breaks the original formatting. 不幸的是,在整理之后,两个浏览器现在在所有nbsp周围显示换行符,这破坏了原始格式。

So, my idea was to browse through all those <o:p> tags and either remove them or add their .text attribute to the parent .text attribute, ie remove the <o:p> tag markers. 因此,我的想法是浏览所有这些<o:p>标记,然后删除它们或将其.text属性添加到父.text属性,即删除<o:p>标记标记。

from lxml import etree
import lxml.html
from StringIO import StringIO

s='<p>somepara</p> <o:p>msoffice_para</o:p>'

parser = lxml.html.HTMLParser()
html=lxml.html.parse( StringIO( s), parser)

for t in html.xpath( "//p"):
     print "tag: " + t.tag + ",  text: '" + t.text + "'"

The result is: 结果是:

tag: p,  text: 'somepara'
tag: p,  text: 'msoffice_para'

So, lxlm removes the namespace name from the tag marker. 因此,lxlm从标记标记中删除名称空间名称。 Is there a way to know which <p> tag is from which namespace, so I only remove the ones with <o:p> ? 有没有办法知道哪个<p>标记来自哪个名称空间,所以我只删除那些带有<o:p>标记?

Thanks. 谢谢。

From the HTML specs: " The HTML syntax does not support namespace declarations ". 根据HTML规范:“ HTML语法不支持名称空间声明 ”。 So I think lxml.html.HTMLParser removes/ignores the namespace. 因此,我认为lxml.html.HTMLParser删除/忽略命名空间。

However, BeautifulSoup parses HTML differently so I thought it might be worth a shot. 但是,BeautifulSoup对HTML的解析方式有所不同,因此我认为值得一试。 If you also have BeautifulSoup installed, you can use the BeautifulSoup parser with lxml like this: 如果还安装了BeautifulSoup,则可以将BeautifulSoup解析器与lxml一起使用,如下所示:

import lxml.html.soupparser as soupparser
import lxml.html
import io
s='<p>somepara</p> <o:p>msoffice_para</o:p>'
html=soupparser.parse(io.BytesIO(s)) 

BeautifulSoup does not remove the namespace, but neither does it recognize the namespace as such. BeautifulSoup不会删除名称空间,但是也不会这样识别名称空间。 Instead, it is just part of the name of the tag. 相反,它只是标记名称的一部分。

That is to say, 也就是说,

html.xpath('//o:p',namespaces={'o':'foo'})

does not work. 不起作用。 But this workaround/hack 但是这种解决方法/黑客

for t in html.xpath('//*[name()="o:p"]'):    
    print "tag: " + t.tag + ",  text: '" + t.text + "'"

yields 产量

tag: o:p,  text: 'msoffice_para'

If the html is actually well-formed, you could use the etree.XMLParser instead. 如果html的格式正确,则可以改用etree.XMLParser Otherwise, try unutbu's answer. 否则,请尝试unutbu的答案。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM