[英]How to obtain a list of strings from an XML file using Python's xml.dom.minidom?
All - 全部-
I am trying to parse the following, very simple XML document structure using 我正在尝试使用解析以下非常简单的XML文档结构
from xml.dom.minidom import parse
The XML looks like this: XML如下所示:
<?xml version="1.0" encoding="utf-8"?>
<list>
<file name="..." url="...">
<words>
word_1
word_2
...
The problem I am having is that the XML contains a list of words that I would like to access as a list of strings ... and I simply can't seem to get it right. 我遇到的问题是XML包含一个我想作为字符串列表访问的单词列表...而我似乎根本无法正确理解它。 Here is what I have in terms of code so far:
到目前为止,这是我所拥有的代码:
import sys
from xml.dom.minidom import parse
for file in sys.argv[1:]:
dom = parse( file )
title = dom.getElementsByTagName( 'job_ad' )[0].getAttribute( 'title' )
# This works
words = dom.getElementsByTagName( 'unigrams' )[0].childNodes[0]
# This is NOT a list of strings ...
I would like to iterate over the data structure 'words' in this code. 我想遍历此代码中的数据结构“字”。 I know there are much more powerful XML modules available ... but for now I would like to solve this with the module shown.
我知道还有很多功能更强大的XML模块可用...但是现在,我想用所示的模块解决这个问题。
Any help with this would be much appreciated. 任何帮助,将不胜感激。
Thanks in advance and kind regards - 在此先感谢您,并诚挚的问候-
Pat 拍
I assume words are listed under words
nodes as plain text, in that case you just need to grab text from words
node and split it eg 我假设单词在
words
节点下以纯文本形式列出,在这种情况下,您只需要从words
节点获取文本并将其拆分即可,例如
s="""<?xml version="1.0" encoding="utf-8"?>
<list>
<file name="..." url="...">
<words>
word_1
word_2
</words>
</file>
</list>"""
import sys
from xml.dom.minidom import parseString
dom = parseString(s)
words_text = dom.getElementsByTagName('words')[0].firstChild.nodeValue
words = words_text.split()
print words
output: 输出:
[u'word_1', u'word_2']
If you're not married to 'xml.dom.minidom', you might want to checkout lxml (http://lxml.de/) 如果您未嫁给“ xml.dom.minidom”,则可能要签出lxml(http://lxml.de/)
The code would be: 该代码将是:
import lxml.etree
doc = lxml.etree.parse( open(file) )
words = doc.findtext('words')
WHOOPS -- I see now the poster specifically requested the answer use 'xml.dom.minidom'. WHOOPS-我现在看到张贴者特别要求使用'xml.dom.minidom'作为答案。 Sorry, we use lxml.
抱歉,我们使用lxml。 You can disregard.
您可以忽略。
It seems in your XML doc, multiple word_X
words are grouped inside an xml element. 似乎在您的XML文档中,多个
word_X
单词被分组在一个xml元素内。 Since they are not different XML elements you cannot query like that. 由于它们不是不同的XML元素,因此您不能像这样查询。 Instead you can use regular expression to parse a single element string
相反,您可以使用正则表达式来解析单个元素字符串
For example: Assume you have wordListAsSingleString
which contains (if you can query that): 例如:假设您有
wordListAsSingleString
包含(如果可以查询的话):
word_1
word_2
re.split('\\s+', wordListAsSingleString)
will give you the list of words. re.split('\\s+', wordListAsSingleString)
将为您提供单词列表。
如果希望单词为字符串,请在末尾添加.data:
words = dom.getElementsByTagName( 'unigrams' )[0].childNodes[0].data
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.