简体   繁体   English

使用元素树无法解析xml归档文件

[英]Trouble parsing xml archive using Element Tree

Python + programming noob here, so you may have to bear with me. 这里有Python +编程新手,所以您可能不得不忍受。 I have a number of xml files (RSS archives) and I want to extract news article urls from them. 我有许多xml文件(RSS档案),我想从它们中提取新闻文章的url。 I'm using Python 2.7.3 on Windows... and here's an example of the code I'm looking at: 我正在Windows上使用Python 2.7.3 ...这是我正在查看的代码示例:

<feed xmlns:media="http://search.yahoo.com/mrss/" xmlns:gr="http://www.google.com/schemas/reader/atom/" xmlns:idx="urn:atom-extension:indexing" xmlns="http://www.w3.org/2005/Atom" idx:index="no" gr:dir="ltr">
<!-- 
Content-type: Preventing XSRF in IE.

 -->
<generator uri="http://www.google.com/reader">Google Reader</generator>
<id>
tag:google.com,2005:reader/feed/http://feeds.smh.com.au/rssheadlines/national.xml
</id>
<title>The Sydney Morning Herald National Headlines</title>
<subtitle type="html">
The top National headlines from The Sydney Morning Herald. For all the news, visit http://www.smh.com.au.
</subtitle>
<gr:continuation>CJPL-LnHybcC</gr:continuation>
<link rel="self" href="http://www.google.com/reader/atom/feed/http://feeds.smh.com.au/rssheadlines/national.xml?n=1000&c=%5BC%5D"/>
<link rel="alternate" href="http://www.smh.com.au/national" type="text/html"/>
<updated>2013-06-16T07:55:56Z</updated>
<entry gr:is-read-state-locked="true" gr:crawl-timestamp-msec="1371369356359">
<id gr:original-id="http://news.smh.com.au/breaking-news-sport/daley-opts-for-dugan-for-origin-two-20130616-2oc5k.html">tag:google.com,2005:reader/item/dabe358abc6c18c5</id>
<category term="user/03956512242887934409/state/com.google/read" scheme="http://www.google.com/reader/" label="read"/>
<title type="html">Daley opts for Dugan for Origin two</title>
<published>2013-06-16T07:12:11Z</published>
<updated>2013-06-16T07:12:11Z</updated>
<link rel="alternate" href="http://rss.feedsportal.com/c/34697/f/644122/s/2d5973e2/l/0Lnews0Bsmh0N0Bau0Cbreaking0Enews0Esport0Cdaley0Eopts0Efor0Edugan0Efor0Eorigin0Etwo0E20A130A6160E2oc5k0Bhtml/story01.htm" type="text/html"/>

Specifically I want to extract the "original id" link: 具体来说,我想提取“原始ID”链接:

<id gr:original-id="http://news.smh.com.au/breaking-news-sport/daley-opts-for-dugan-for-origin-two-20130616-2oc5k.html">tag:google.com,2005:reader/item/dabe358abc6c18c5</id>

I originally tried using BeautifulSoup for this but ran into problems, and from the research I did it looks like Element Tree is the way to go. 我最初尝试使用BeautifulSoup来解决这个问题,但是遇到了问题,从研究中我发现,Element Tree是可行的方法。 First off with ET I tried: 首先,我尝试使用ET:

import xml.etree.ElementTree as ET
tree = ET.parse('thefile.xml')
root = tree.getroot()

#first_original_id = root[8][0]

parents_of_interest = root[8::]

for elem in parents_of_interest:
    print elem.items()[0][1]

So far as I can work out parents_of_interest does grab the data I want (as a list of dictionaries) but the for loop only returns a bunch of true statements, and after reading the documentation and SO it seems like this is the wrong approach. 据我parents_of_interest确实可以获取我想要的数据(作为字典列表),但是for循环仅返回一堆true语句,并且在阅读了文档和SO之后,看来这是错误的方法。

I think this has the answer I'm looking for but even though it's a good explanation I can't seem to apply it to my own situation. 我认为具有我正在寻找的答案,但是即使这是一个很好的解释,我似乎也无法将其应用于自己的情况。 From that answer I tried: 从那个答案我尝试了:

print tree.find('//{http://www.w3.org/2005/Atom}entry}id').text

But got the error: 但是得到了错误:

__main__:1: FutureWarning: This search is broken in 1.3 and earlier, and will be fixed in a future version.  If you rely
 on the current behaviour, change it to './/{http://www.w3.org/2005/Atom}entry}id'
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'text'

Any help on this would be appreciated... and sorry if that's a verbose question... but I thought I'd detail everything... just in case. 在这方面的任何帮助将不胜感激...如果这是一个冗长的问题,则表示抱歉...但是我想我会详细介绍所有内容...以防万一。

Your xpath expression matches the first id, not the one you're looking for and original-id is an attribute of the element, so you should write something like that: 您的xpath表达式匹配第一个id,而不是您要查找的id,而original-id是元素的属性,因此您应该编写如下代码:

idelem = tree.find('./{http://www.w3.org/2005/Atom}entry/{http://www.w3.org/2005/Atom}id')
if idelem is not None:
    print idelem.get('{http://www.google.com/schemas/reader/atom/}original-id')

That will find only the first matching id, if you want them all, use findall and iterate over the results. 那将只找到第一个匹配的id,如果需要它们,请使用findall并遍历结果。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM