![](/img/trans.png)
[英]python lxml iterparse fails on large files containing namespaces
[英]lxml iterparse in python can't handle namespaces
from lxml import etree
import StringIO
data= StringIO.StringIO('<root xmlns="http://some.random.schema"><a>One</a><a>Two</a><a>Three</a></root>')
docs = etree.iterparse(data,tag='a')
a,b = docs.next()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "iterparse.pxi", line 478, in lxml.etree.iterparse.__next__ (src/lxml/lxml.etree.c:95348)
File "iterparse.pxi", line 534, in lxml.etree.iterparse._read_more_events (src/lxml/lxml.etree.c:95938)
StopIteration
工作正常,直到我將命名空間添加到根節點。 關於我可以做些什么作為解決方法或正確的方法的任何想法? 由於文件非常大,我需要事件驅動。
當附加命名空間時,標簽不是a
,而是{http://some.random.schema}a
。 試試這個(Python 3):
from lxml import etree
from io import BytesIO
xml = '''\
<root xmlns="http://some.random.schema">
<a>One</a>
<a>Two</a>
<a>Three</a>
</root>'''
data = BytesIO(xml.encode())
docs = etree.iterparse(data, tag='{http://some.random.schema}a')
for event, elem in docs:
print(f'{event}: {elem}')
或者,在 Python 2 中:
from lxml import etree
from StringIO import StringIO
xml = '''\
<root xmlns="http://some.random.schema">
<a>One</a>
<a>Two</a>
<a>Three</a>
</root>'''
data = StringIO(xml)
docs = etree.iterparse(data, tag='{http://some.random.schema}a')
for event, elem in docs:
print event, elem
這會打印出類似的內容:
end: <Element {http://some.random.schema}a at 0x10941e730>
end: <Element {http://some.random.schema}a at 0x10941e8c0>
end: <Element {http://some.random.schema}a at 0x10941e960>
正如@mihail-shcheglov 所指出的,通配符*
也可以使用,它適用於任何名稱空間或沒有名稱空間:
from lxml import etree
from io import BytesIO
xml = '''\
<root xmlns="http://some.random.schema">
<a>One</a>
<a>Two</a>
<a>Three</a>
</root>'''
data = BytesIO(xml.encode())
docs = etree.iterparse(data, tag='{*}a')
for event, elem in docs:
print(f'{event}: {elem}')
有關更多信息,請參閱lxml.etree 文檔。
為什么不用正則表達式?
使用 lxml 比使用正則表達式慢。
from time import clock
import StringIO
from lxml import etree
times1 = []
for i in xrange(1000):
data= StringIO.StringIO('<root ><a>One</a><a>Two</a><a>Three\nlittle pigs</a><b>Four</b><a>another</a></root>')
te = clock()
docs = etree.iterparse(data,tag='a')
tf = clock()
times1.append(tf-te)
print min(times1)
print [etree.tostring(y) for x,y in docs]
import re
regx = re.compile('<a>[\s\S]*?</a>')
times2 = []
for i in xrange(1000):
data= StringIO.StringIO('<root ><a>One</a><a>Two</a><a>Three\nlittle pigs</a><b>Four</b><a>another</a></root>')
te = clock()
li = regx.findall(data.read())
tf = clock()
times2.append(tf-te)
print min(times2)
print li
結果
0.000150298431784
['<a>One</a>', '<a>Two</a>', '<a>Three\nlittle pigs</a>', '<a>another</a>']
2.40253998762e-05
['<a>One</a>', '<a>Two</a>', '<a>Three\nlittle pigs</a>', '<a>another</a>']
0.000150298431784 / 2.40253998762e-05 為 6.25
lxml 比正則表達式慢 6.25 倍
.
如果命名空間沒有問題:
import StringIO
import re
regx = re.compile('<a>[\s\S]*?</a>')
data= StringIO.StringIO('<root xmlns="http://some.random.schema"><a>One</a><a>Two</a><a>Three\nlittle pigs</a><b>Four</b><a>another</a></root>')
print regx.findall(data.read())
結果
['<a>One</a>', '<a>Two</a>', '<a>Three\nlittle pigs</a>', '<a>another</a>']
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.