简体   繁体   English

如何使用Python的xml.dom.minidom从XML文件获取字符串列表?

[英]How to obtain a list of strings from an XML file using Python's xml.dom.minidom?

All - 全部-

I am trying to parse the following, very simple XML document structure using 我正在尝试使用解析以下非常简单的XML文档结构

from xml.dom.minidom import parse

The XML looks like this: XML如下所示:

<?xml version="1.0" encoding="utf-8"?>
    <list>
       <file name="..." url="...">
       <words>
           word_1
           word_2
           ...

The problem I am having is that the XML contains a list of words that I would like to access as a list of strings ... and I simply can't seem to get it right. 我遇到的问题是XML包含一个我想作为字符串列表访问的单词列表...而我似乎根本无法正确理解它。 Here is what I have in terms of code so far: 到目前为止,这是我所拥有的代码:

import sys
from xml.dom.minidom import parse

for file in sys.argv[1:]:

    dom = parse( file )

    title = dom.getElementsByTagName( 'job_ad' )[0].getAttribute( 'title' )
    # This works 

    words = dom.getElementsByTagName( 'unigrams' )[0].childNodes[0]

    # This is NOT a list of strings ... 

I would like to iterate over the data structure 'words' in this code. 我想遍历此代码中的数据结构“字”。 I know there are much more powerful XML modules available ... but for now I would like to solve this with the module shown. 我知道还有很多功能更强大的XML模块可用...但是现在,我想用所示的模块解决这个问题。

Any help with this would be much appreciated. 任何帮助,将不胜感激。

Thanks in advance and kind regards - 在此先感谢您,并诚挚的问候-

Pat

I assume words are listed under words nodes as plain text, in that case you just need to grab text from words node and split it eg 我假设单词在words节点下以纯文本形式列出,在这种情况下,您只需要从words节点获取文本并将其拆分即可,例如

s="""<?xml version="1.0" encoding="utf-8"?>
    <list>
       <file name="..." url="...">
       <words>
           word_1
           word_2
        </words>
       </file>
    </list>"""

import sys
from xml.dom.minidom import parseString

dom = parseString(s)
words_text = dom.getElementsByTagName('words')[0].firstChild.nodeValue
words = words_text.split()
print words

output: 输出:

[u'word_1', u'word_2']

If you're not married to 'xml.dom.minidom', you might want to checkout lxml (http://lxml.de/) 如果您未嫁给“ xml.dom.minidom”,则可能要签出lxml(http://lxml.de/)

The code would be: 该代码将是:

import lxml.etree
doc = lxml.etree.parse( open(file) )
words = doc.findtext('words')

WHOOPS -- I see now the poster specifically requested the answer use 'xml.dom.minidom'. WHOOPS-我现在看到张贴者特别要求使用'xml.dom.minidom'作为答案。 Sorry, we use lxml. 抱歉,我们使用lxml。 You can disregard. 您可以忽略。

It seems in your XML doc, multiple word_X words are grouped inside an xml element. 似乎在您的XML文档中,多个word_X单词被分组在一个xml元素内。 Since they are not different XML elements you cannot query like that. 由于它们不是不同的XML元素,因此您不能像这样查询。 Instead you can use regular expression to parse a single element string 相反,您可以使用正则表达式来解析单个元素字符串
For example: Assume you have wordListAsSingleString which contains (if you can query that): 例如:假设您有wordListAsSingleString包含(如果可以查询的话):

       word_1
       word_2

re.split('\\s+', wordListAsSingleString) will give you the list of words. re.split('\\s+', wordListAsSingleString)将为您提供单词列表。

如果希望单词为字符串,请在末尾添加.data:

words = dom.getElementsByTagName( 'unigrams' )[0].childNodes[0].data

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM