如何使用Python的xml.dom.minidom从XML文件获取字符串列表？

Question

All - 全部-

I am trying to parse the following, very simple XML document structure using 我正在尝试使用解析以下非常简单的XML文档结构

from xml.dom.minidom import parse

The XML looks like this: XML如下所示：

<?xml version="1.0" encoding="utf-8"?>
    <list>
       <file name="..." url="...">
       <words>
           word_1
           word_2
           ...

The problem I am having is that the XML contains a list of words that I would like to access as a list of strings ... and I simply can't seem to get it right. 我遇到的问题是XML包含一个我想作为字符串列表访问的单词列表...而我似乎根本无法正确理解它。 Here is what I have in terms of code so far: 到目前为止，这是我所拥有的代码：

import sys
from xml.dom.minidom import parse

for file in sys.argv[1:]:

    dom = parse( file )

    title = dom.getElementsByTagName( 'job_ad' )[0].getAttribute( 'title' )
    # This works 

    words = dom.getElementsByTagName( 'unigrams' )[0].childNodes[0]

    # This is NOT a list of strings ...

I would like to iterate over the data structure 'words' in this code. 我想遍历此代码中的数据结构“字”。 I know there are much more powerful XML modules available ... but for now I would like to solve this with the module shown. 我知道还有很多功能更强大的XML模块可用...但是现在，我想用所示的模块解决这个问题。

Any help with this would be much appreciated. 任何帮助，将不胜感激。

Thanks in advance and kind regards - 在此先感谢您，并诚挚的问候-

Pat 拍

Answer 1

I assume words are listed under words nodes as plain text, in that case you just need to grab text from words node and split it eg 我假设单词在words节点下以纯文本形式列出，在这种情况下，您只需要从words节点获取文本并将其拆分即可，例如

s="""<?xml version="1.0" encoding="utf-8"?>
    <list>
       <file name="..." url="...">
       <words>
           word_1
           word_2
        </words>
       </file>
    </list>"""

import sys
from xml.dom.minidom import parseString

dom = parseString(s)
words_text = dom.getElementsByTagName('words')[0].firstChild.nodeValue
words = words_text.split()
print words

output: 输出：

[u'word_1', u'word_2']

Answer 2

If you're not married to 'xml.dom.minidom', you might want to checkout lxml (http://lxml.de/) 如果您未嫁给“ xml.dom.minidom”，则可能要签出lxml（http://lxml.de/）

The code would be: 该代码将是：

import lxml.etree
doc = lxml.etree.parse( open(file) )
words = doc.findtext('words')

WHOOPS -- I see now the poster specifically requested the answer use 'xml.dom.minidom'. WHOOPS-我现在看到张贴者特别要求使用'xml.dom.minidom'作为答案。 Sorry, we use lxml. 抱歉，我们使用lxml。 You can disregard. 您可以忽略。

Answer 3

It seems in your XML doc, multiple word_X words are grouped inside an xml element. 似乎在您的XML文档中，多个word_X单词被分组在一个xml元素内。 Since they are not different XML elements you cannot query like that. 由于它们不是不同的XML元素，因此您不能像这样查询。 Instead you can use regular expression to parse a single element string 相反，您可以使用正则表达式来解析单个元素字符串
For example: Assume you have wordListAsSingleString which contains (if you can query that): 例如：假设您有wordListAsSingleString包含（如果可以查询的话）：

       word_1
       word_2

re.split('\\s+', wordListAsSingleString) will give you the list of words. re.split('\\s+', wordListAsSingleString)将为您提供单词列表。

Answer 4

如果希望单词为字符串，请在末尾添加.data：

words = dom.getElementsByTagName( 'unigrams' )[0].childNodes[0].data

如何使用Python的xml.dom.minidom从XML文件获取字符串列表？

问题描述

4 个解决方案

解决方案1
1 2012-03-30 20:55:23

解决方案2
0 2012-03-30 20:35:07

解决方案3
0 2012-03-30 20:45:32

解决方案4
0 2012-03-30 20:53:12

如何使用Python的xml.dom.minidom从XML文件获取字符串列表？

问题描述

4 个解决方案

解决方案1 1 2012-03-30 20:55:23

解决方案2 0 2012-03-30 20:35:07

解决方案3 0 2012-03-30 20:45:32

解决方案4 0 2012-03-30 20:53:12

解决方案1
1 2012-03-30 20:55:23

解决方案2
0 2012-03-30 20:35:07

解决方案3
0 2012-03-30 20:45:32

解决方案4
0 2012-03-30 20:53:12