简体   繁体   English

使用minidom解析xml

[英]using minidom to parse xml

Hi I have trouble understanding the minidom module for Python. 嗨,我无法理解Python的minidom模块。

I have xml that looks like this: 我有xml看起来像这样:

<Show>
<name>Dexter</name>
<totalseasons>7</totalseasons>
<Episodelist>
<Season no="1">
<episode>
<epnum>1</epnum>
<seasonnum>01</seasonnum>
<prodnum>101</prodnum>
<airdate>2006-10-01</airdate>
<link>http://www.tvrage.com/Dexter/episodes/408409</link>
<title>Dexter</title>
</episode>
<episode>
<epnum>2</epnum>
<seasonnum>02</seasonnum>
<prodnum>102</prodnum>
<airdate>2006-10-08</airdate>
<link>http://www.tvrage.com/Dexter/episodes/408410</link>
<title>Crocodile</title>
</episode>
<episode>
<epnum>3</epnum>
<seasonnum>03</seasonnum>
<prodnum>103</prodnum>
<airdate>2006-10-15</airdate>
<link>http://www.tvrage.com/Dexter/episodes/408411</link>
<title>Popping Cherry</title>
</episode>

More pretty: http://services.tvrage.com/feeds/episode_list.php?sid=7926 更漂亮: http//services.tvrage.com/feeds/episode_list.php? sid = 7926

And this is my python code trying to read from that: 这是我试图从中读取的python代码:

xml = minidom.parse(urlopen("http://services.tvrage.com/feeds/episode_list.php?sid=7926"))
for episode in xml.getElementsByTagName('episode'):
    for node in episode.attributes['title']:
        print node.data

I can't get the actual episode data out as I want to get all the data from each episode. 我无法获得实际的剧集数据,因为我想从每集中获取所有数据。 I've tried different variants but I can't get it to work. 我尝试了不同的变种,但我无法让它发挥作用。 Mostly I get a <DOM Element: asdasd> back. 大多数情况下,我得到一个<DOM Element: asdasd> I only care about the data inside each episode. 我只关心每集中的数据。

Thanks for the help 谢谢您的帮助

title is not an attribute, its a tag. title不是属性,它是一个标记。 An attribute is like src in <img src="foo.jpg" /> 一个属性是像src<img src="foo.jpg" />

>>> parsed = parseString(s)
>>> titles = [n.firstChild.data for n in parsed.getElementsByTagName('title')]
>>> titles
[u'Dexter', u'Crocodile', u'Popping Cherry']

You can extend the above to fetch other details. 您可以扩展上面的内容以获取其他详细信息。 lxml is better suited for this though. lxml更适合这个。 As you can see from the snippet above minidom is not that friendly. 正如你从上面的片段中看到的那样minidom并不那么友好。

Each episode element has child-elements, including a title element. 每个episode元素都有子元素,包括title元素。 Your code, however, is looking for attributes instead. 但是,您的代码正在寻找属性

To get text out of a minidom element, you need a helper function: 要从minidom元素中获取文本,您需要一个辅助函数:

def getText(nodelist):
    rc = []
    for node in nodelist:
        if node.nodeType == node.TEXT_NODE:
            rc.append(node.data)
    return ''.join(rc)

And then you can more easily print all the titles: 然后你可以更轻松地打印所有标题:

for episode in xml.getElementsByTagName('episode'):
    for title in episode.getElementsByTagName('title'):
        print getText(title)

Thanks to Martijn Pieters who tipped me with the ElementTree API I solved this problem. 感谢Martijn Pieters向我提供了ElementTree API,我解决了这个问题。

xml = ET.parse(urlopen("http://services.tvrage.com/feeds/episode_list.php?sid=7296"))
                print 'xml fetched..'
                for episode in xml.iter('episode'):
                    print episode.find('title').text

Thanks 谢谢

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM