Why my XML parsing code isn't working (Python)

Question

Below is a partial part of an XML file I'm trying to retrieve information from, I get a result that has the word "None" 10 times (I have only 10 records in my XML file). I'm not sure what the problem is...

I have copied the code at the end of this post.

<?xml version="1.0" encoding="UTF-8"?>
<xml>
    <records>
        <record>
            <database name="My Collection.enl" path="My Collection.enl">My Collection.enl</database>
            <ref-type name="Book">1</ref-type>
            <contributors>
                <authors>
                    <author>AIA Research Corporation</author>
                </authors>
            </contributors>
            <titles>
                <title>Regional guidelines for building passive energy conserving homes</title>
            </titles>
            <periodical/>
            <keywords/>
            <dates>
                <year>1978</year>
            </dates>
            <publisher>Dept. of Housing and Urban Development, Office of Policy Development and Research : for sale by the Supt. of Docs., U.S. Govt. Print. Off.</publisher>
            <urls/>
            <label>Energy;Green Buildings;High Performance Buildings</label>
        </record>
        <record>
            <database name="My Collection.enl" path="My Collection.enl">My Collection.enl</database>
            <ref-type name="Book">1</ref-type>
            <contributors>
                <authors>
                    <author>Akinci, Burcu</author>
                    <author>Ph, D</author>
                </authors>
            </contributors>
            <titles>
                <title>Computing in Civil Engineering</title>
            </titles>
            <periodical/>
            <pages>692-699</pages>
            <keywords/>
            <dates>
                <year>2007</year>
            </dates>
            <publisher>American Society of Civil Engineers</publisher>
            <isbn>9780784409374</isbn>
            <electronic-resource-num>ISBN 978-0-7844-1302-9</electronic-resource-num>
            <urls>
                <web-urls>
                    <url>http://books.google.com/books?id=QigBgc-qgdoC</url>
                </web-urls>
            </urls>
            <label>Computing</label>
        </record>

Here is the code:

import xml.etree.ElementTree as ET

tree =ET.parse('My_Collection.xml')
root = tree.getroot()
for child in root:
    for children in child:
        print (children.text)

    print("\n")

Update, I fixed my code, but I got this strange error message, also some of the records are missing the book title, below is the updated code and the results.

import xml.etree.ElementTree as ET

tree =ET.parse('My_Collection.xml')
root = tree.getroot()

for child in root:
    for children in child:
        for books in children:
            print (books.text)
        print ('\n')

Here is the result:

My Collection.enl
1
None
None
None
None
None
Dept. of Housing and Urban Development, Office of Policy Development and Research : for sale by the Supt. of Docs., U.S. Govt. Print. Off.
None
Energy;Green Buildings;High Performance Buildings

My Collection.enl
1
None
None
None
692-699
None
None
American Society of Civil Engineers
9780784409374
ISBN 978-0-7844-1302-9
None
Computing


My Collection.enl
0
None
None
None
291-314
4
4
None
None
None
Computing;Design;Green Buildings


My Collection.enl
0
None
None
None
1847-1870
3
9
None
None
10.3390/rs3091847
None
Infrared;Laser scanning


My Collection.enl
0
None
None
None
Nr. 15
15
None
None
ISSN~~1435-618X
ISSN 1435-618X
None
Outdoor Thermal Comfort;Urban Desgin
Traceback (most recent call last):
  File "Mend_lib_Xml_Excel.py", line 9, in <module>
    print (books.text)
  File "C:\Python27\lib\encodings\cp437.py", line 12, in encode
    return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\ufffd' in position 679: character maps to <undefined>

C:\Users\Rania\Google Drive\Rania's Documents\EDX and Coursera\Python_Michigan\Course1>

Answer 1

A common issue with retrieving data from an XML file is that you're not on the node you think you are.

So confirm your assumptions. Print the node name (rather than the text) to confirm which nodes you're on.

If you're having issues with a particular record then simplify your problem, reduce your XML file to just that record and test (print the nodes again). It's possible there is something different in that XML that is causing your code not to work (it's malformed, or it has a different structure or different data).

One issue that you are having above is that...

print (children.text)

will print nothing if the node is a parent (and has no text). An example of this is TITLES tag. This tag has no text, just a child node. The child node has the text. As such you need to navigate to the child node to access the text in TITLE.

Why my XML parsing code isn't working (Python)

Question

1 answers

solution1
1 ACCPTED 2015-12-03 02:54:50

Why my XML parsing code isn't working (Python)

Question

1 answers

solution1 1 ACCPTED 2015-12-03 02:54:50

solution1
1 ACCPTED 2015-12-03 02:54:50