[英]Python XML Pull Parser
I am trying to parse an XML file using Python. 我正在尝试使用Python解析XML文件。 Due to the size of the XML, I want to use a Pull Parser.
由于XML的大小,我想使用Pull Parser。 I found this one.
我找到了这个 。
My code starts with 我的代码以
doc = pulldom.parse("myfile.xml")
for event, node in doc:
# code here...
I am using 我在用
if (node.localName == "b"):
to get the XML tag name, and it works fine. 获取XML标签名称,并且效果很好。
What I can't find how to do is get the text from between the tags. 我找不到怎么做的是从标签之间获取文本。 Using
node.nodeValue
returns None
. 使用
node.nodeValue
返回None
。
I can use node.toxml()
to get the full XML for the node, but I only want the text between the tags. 我可以使用
node.toxml()
来获取该节点的完整XML,但是我只想要标记之间的文本。 Is there a way to do this other than using a regex replace to take the tags out of node.toxml()
? 除了使用正则表达式替换将标签从
node.toxml()
取出之外, node.toxml()
吗?
You have two nodes with local name "b" for every tag with text - a START_ELEMENT
and an END_ELEMENT
. 对于每个带有文本的标签,您有两个本地名称为“ b”的节点
START_ELEMENT
和END_ELEMENT
。 Normally you should receive something like this: 通常,您应该收到以下内容:
START_ELEMENT
CHARACTERS
END_ELEMENT
So you are looking for the characters after a matching start-element. 因此,您要在匹配起始元素之后寻找字符。 You may want to try something like this:
您可能要尝试这样的事情:
from xml.dom.pulldom import CHARACTERS, START_ELEMENT, parse
doc = parse("myfile.xml")
text_expected = False
for event, node in doc:
print event, node
if text_expected:
text_expected = False
if event != CHARACTERS:
# strange .. there should be some
continue
print node.data
else:
text_expected = (event == START_ELEMENT) and (node.localName == "b")
With this myfile.xml
有了这个
myfile.xml
<a>
<b>c1</b>
<b>c2</b>
</a>
I get the output 我得到了输出
c1
c2
Note that you might need to strip()
each string and you must ignore every other CHARACTERS
-event. 请注意,您可能需要
strip()
每个字符串,并且必须忽略所有其他CHARACTERS
-event。 Every linebreak and whitespace between two elements generate a CHARACTERS
-event. 两个元素之间的每个换行和空格都会生成
CHARACTERS
事件。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.