简体   繁体   English

为什么我的 XML 解析代码不起作用 (Python)

[英]Why my XML parsing code isn't working (Python)

Below is a partial part of an XML file I'm trying to retrieve information from, I get a result that has the word "None" 10 times (I have only 10 records in my XML file).下面是我试图从中检索信息的 XML 文件的一部分,我得到的结果是“无”一词 10 次(我的 XML 文件中只有 10 条记录)。 I'm not sure what the problem is...我不确定是什么问题...

I have copied the code at the end of this post.我已经复制了这篇文章末尾的代码。

<?xml version="1.0" encoding="UTF-8"?>
<xml>
    <records>
        <record>
            <database name="My Collection.enl" path="My Collection.enl">My Collection.enl</database>
            <ref-type name="Book">1</ref-type>
            <contributors>
                <authors>
                    <author>AIA Research Corporation</author>
                </authors>
            </contributors>
            <titles>
                <title>Regional guidelines for building passive energy conserving homes</title>
            </titles>
            <periodical/>
            <keywords/>
            <dates>
                <year>1978</year>
            </dates>
            <publisher>Dept. of Housing and Urban Development, Office of Policy Development and Research : for sale by the Supt. of Docs., U.S. Govt. Print. Off.</publisher>
            <urls/>
            <label>Energy;Green Buildings;High Performance Buildings</label>
        </record>
        <record>
            <database name="My Collection.enl" path="My Collection.enl">My Collection.enl</database>
            <ref-type name="Book">1</ref-type>
            <contributors>
                <authors>
                    <author>Akinci, Burcu</author>
                    <author>Ph, D</author>
                </authors>
            </contributors>
            <titles>
                <title>Computing in Civil Engineering</title>
            </titles>
            <periodical/>
            <pages>692-699</pages>
            <keywords/>
            <dates>
                <year>2007</year>
            </dates>
            <publisher>American Society of Civil Engineers</publisher>
            <isbn>9780784409374</isbn>
            <electronic-resource-num>ISBN 978-0-7844-1302-9</electronic-resource-num>
            <urls>
                <web-urls>
                    <url>http://books.google.com/books?id=QigBgc-qgdoC</url>
                </web-urls>
            </urls>
            <label>Computing</label>
        </record>

Here is the code:这是代码:

import xml.etree.ElementTree as ET

tree =ET.parse('My_Collection.xml')
root = tree.getroot()
for child in root:
    for children in child:
        print (children.text)

    print("\n")

Update, I fixed my code, but I got this strange error message, also some of the records are missing the book title, below is the updated code and the results.更新,我修复了我的代码,但我收到了这个奇怪的错误消息,还有一些记录缺少书名,下面是更新后的代码和结果。

import xml.etree.ElementTree as ET

tree =ET.parse('My_Collection.xml')
root = tree.getroot()

for child in root:
    for children in child:
        for books in children:
            print (books.text)
        print ('\n')

Here is the result:结果如下:

My Collection.enl
1
None
None
None
None
None
Dept. of Housing and Urban Development, Office of Policy Development and Research : for sale by the Supt. of Docs., U.S. Govt. Print. Off.
None
Energy;Green Buildings;High Performance Buildings

My Collection.enl
1
None
None
None
692-699
None
None
American Society of Civil Engineers
9780784409374
ISBN 978-0-7844-1302-9
None
Computing


My Collection.enl
0
None
None
None
291-314
4
4
None
None
None
Computing;Design;Green Buildings


My Collection.enl
0
None
None
None
1847-1870
3
9
None
None
10.3390/rs3091847
None
Infrared;Laser scanning


My Collection.enl
0
None
None
None
Nr. 15
15
None
None
ISSN~~1435-618X
ISSN 1435-618X
None
Outdoor Thermal Comfort;Urban Desgin
Traceback (most recent call last):
  File "Mend_lib_Xml_Excel.py", line 9, in <module>
    print (books.text)
  File "C:\Python27\lib\encodings\cp437.py", line 12, in encode
    return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\ufffd' in position 679: character maps to <undefined>

C:\Users\Rania\Google Drive\Rania's Documents\EDX and Coursera\Python_Michigan\Course1>

A common issue with retrieving data from an XML file is that you're not on the node you think you are.从 XML 文件检索数据的一个常见问题是您不在您认为的节点上。

So confirm your assumptions.所以确认你的假设。 Print the node name (rather than the text) to confirm which nodes you're on.打印节点名称(而不是文本)以确认您所在的节点。

If you're having issues with a particular record then simplify your problem, reduce your XML file to just that record and test (print the nodes again).如果您遇到特定记录的问题,请简化您的问题,将您的 XML 文件缩减为该记录并进行测试(再次打印节点)。 It's possible there is something different in that XML that is causing your code not to work (it's malformed, or it has a different structure or different data). XML 中可能有一些不同的东西导致您的代码无法工作(格式错误,或者它具有不同的结构或不同的数据)。

One issue that you are having above is that...您在上面遇到的一个问题是...

print (children.text)打印 (children.text)

will print nothing if the node is a parent (and has no text).如果节点是父节点(并且没有文本),则不会打印任何内容。 An example of this is TITLES tag.一个例子是 TITLES 标签。 This tag has no text, just a child node.这个标签没有文本,只有一个子节点。 The child node has the text.子节点具有文​​本。 As such you need to navigate to the child node to access the text in TITLE.因此,您需要导航到子节点以访问 TITLE 中的文本。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM