使用帶有錯誤標記的Python ElementTree解析XML

Question

我正在嘗試使用Python來解析XML文件，以從XML提要中獲取標題，作者，URL和摘要。 然后我確保我們收集數據的XML是這樣的：

<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom"
  xmlns:grddl="http://www.w3.org/2003/g/data-view#"
  grddl:transformation="2turtle_xslt-1.0.xsl">

<title>Our Site RSS</title>
<link href="http://www.oursite.com" />
<updated>2013-08-14T20:05:08-04:00</updated>
<id>urn:uuid:c60d7202-9a58-46a6-9fca-f804s879f5ebc</id>
<rights>
    Original content available for non-commercial use under a Creative
    Commons license (Attribution-NonCommercial-NoDerivs 3.0 Unported),
    except where noted.
</rights>

<entry>
    <title>Headline #1</title>
    <author>
        <name>John Smith</name>
    </author>
    <link rel="alternate"
          href="http://www.oursite.com/our-slug/" />
    <id>1234</id>
    <updated>2013-08-13T23:45:43-04:00</updated>

    <summary type="html">
        Here is a summary of our story
    </summary>
</entry>
<entry>
    <title>Headline #2</title>
    <author>
        <name>John Smith</name>
    </author>
    <link rel="alternate"
          href="http://www.oursite.com/our-slug-2/" />
    <id>1235</id>
    <updated>2013-08-13T23:45:43-04:00</updated>

    <summary type="html">
        Here is a summary of our second story
    </summary>
</entry>

我的代碼是：

import xml.etree.ElementTree as ET
tree = ET.parse('data.xml')
root = tree.getroot()

for child in root:
    print child.tag

當Python打印child.tag時，標簽不是標簽“entry”，而是“{ http://www.w3.org/2005/Atom } entry”。 我試過用：

for entry in root.findall('entry'):

但這不起作用，因為條目標記包含作為根標記一部分的w3 url。 此外，讓root的孫子們顯示他們的標簽為“{ http://www.w3.org/2005/Atom } author”

我無法更改XML，但如何修改它（將root設置為）並重新保存或更改我的代碼以便root.findall（'entry'）有效？

Answer 1

這是標准的ElementTree行為。 如果您要搜索的標記是在命名空間中聲明的，則在搜索這些標記時必須指定該命名空間。 但是，你可以這樣做：

import xml.etree.ElementTree as ET
tree = ET.parse('data.xml')
root = tree.getroot()

def prepend_ns(s):
    return '{http://www.w3.org/2005/Atom}' + s

for entry in root.findall(prepend_ns('entry')):
    print 'Entry:'
    print '    Title: '   + entry.find(prepend_ns('title')).text
    print '    Author: '  + entry.find(prepend_ns('author')).find(prepend_ns('name')).text
    print '    URL: '     + entry.find(prepend_ns('link')).attrib['href']
    print '    Summary: ' + entry.find(prepend_ns('summary')).text

Answer 2

嘗試BeautifulSoup4，它不僅非常強大，不僅可以解析XML而且還可以解析HTML等。這是一個轉發代碼，希望對您有所幫助。

from bs4 import BeautifulSoup

def main():
    input = """....""" 
    soup = BeautifulSoup(input)   
    for entry in soup.findAll("entry"):
        title = entry.find("title").text.strip()
        author = entry.find("author").text.strip()
        link  = entry.find("link").text.strip()
        summary = entry.find("summary").text.strip()
        print title, author, link, summary
if __name__ == '__main__':
    main()

使用帶有錯誤標記的Python ElementTree解析XML

問題描述

2 個解決方案

解決方案1
5 已采納 2013-08-15 01:19:02

解決方案2
1 2013-08-15 01:28:27

使用帶有錯誤標記的Python ElementTree解析XML

問題描述

2 個解決方案

解決方案1 5 已采納 2013-08-15 01:19:02

解決方案2 1 2013-08-15 01:28:27

解決方案1
5 已采納 2013-08-15 01:19:02

解決方案2
1 2013-08-15 01:28:27