简体   繁体   中英

How to Convert parsed XML to pandas dataframe or CSV in python?

My sample xml is

<RecordContainer RecordNumber = "1">
<catalog>
   <book id="bk101">
      <author>Gambardella, Matthew</author>
      <title>XML Developer's Guide</title>
      <genre>Computer</genre>
      <price>44.95</price>
      <publish_date>2000-10-01</publish_date>
      <description>An in-depth look at creating applications 
      with XML.</description>
   </book>
 </catalog>
</RecordContainer>
<RecordContainer RecordNumber = "2">
 <catalog>  
   <book id="bk102">
      <author>Ralls, Kim</author>
      <title>Midnight Rain</title>
      <genre>Fantasy</genre>
      <price>5.95</price>
      <publish_date>2000-12-16</publish_date>
      <description>A former architect battles corporate zombies, 
      an evil sorceress, and her own childhood to become queen 
      of the world.</description>
   </book>
</catalog>
</RecordContainer>

My code to parse the above

import xml.etree.ElementTree as ET
tree = ET.fromstring("<root>"+ sample_data + "</root>")

Now after parsing I want to covert it to pandas dataframe or a csv file. To convert to pandas dataframe following is my code

def f(elem, result):
    result[elem.tag] = elem.text
    cs = elem.getchildren()
    for c in cs:
        result = f(c, result)
        return result

d = f(tree, {})
df = pd.DataFrame(d, index=['values'])

But the above code is returning me empty value in df.

How do I convert above parsed xml to pandas df or csv file?

The code below convert the xml to a flat dict

import xml.etree.ElementTree as ET
import pandas as pd

xml = '''<r><RecordContainer RecordNumber = "1">
<catalog>
   <book id="bk101">
      <author>Gambardella, Matthew</author>
      <title>XML Developer's Guide</title>
      <genre>Computer</genre>
      <price>44.95</price>
      <publish_date>2000-10-01</publish_date>
      <description>An in-depth look at creating applications 
      with XML.</description>
   </book>
 </catalog>
</RecordContainer>
<RecordContainer RecordNumber = "2">
 <catalog>  
   <book id="bk102">
      <author>Ralls, Kim</author>
      <title>Midnight Rain</title>
      <genre>Fantasy</genre>
      <price>5.95</price>
      <publish_date>2000-12-16</publish_date>
      <description>A former architect battles corporate zombies, 
      an evil sorceress, and her own childhood to become queen 
      of the world.</description>
   </book>
</catalog>
</RecordContainer></r>'''

root = ET.fromstring(xml)
records = []
containers = root.findall('.//RecordContainer')
for container in containers:
    entry = container.attrib
    book = container.find('.//catalog/book')
    entry.update(book.attrib)
    for child in list(book):
        entry[child.tag] = child.text
    records.append(entry)

for rec in records:
    print(rec)
df = pd.DataFrame(records)
print(df)

output

{'RecordNumber': '1', 'id': 'bk101', 'author': 'Gambardella, Matthew', 'title': "XML Developer's Guide", 'genre': 'Computer', 'price': '44.95', 'publish_date': '2000-10-01', 'description': 'An in-depth look at creating applications \n      with XML.'}
{'RecordNumber': '2', 'id': 'bk102', 'author': 'Ralls, Kim', 'title': 'Midnight Rain', 'genre': 'Fantasy', 'price': '5.95', 'publish_date': '2000-12-16', 'description': 'A former architect battles corporate zombies, \n      an evil sorceress, and her own childhood to become queen \n      of the world.'}

  RecordNumber                author  ... publish_date                  title
0            1  Gambardella, Matthew  ...   2000-10-01  XML Developer's Guide
1            2            Ralls, Kim  ...   2000-12-16          Midnight Rain

[2 rows x 8 columns]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM