简体   繁体   中英

Converting complex XML file to Pandas dataframe/CSV - Python

I'm currently in the middle of converting a complex XML file to csv or pandas df. I have zero experience with xml data format and all the code suggestions I found online are just not working for me. Can anyone kindly help me with this?

There are lots of elements in the data that I do not need so I won't include those here.

For privacy reasons I won't be uploading the original data here but I'll be sharing what the structure looks like.

<RefData>
  <Attributes>
    <Id>1011</Id>
    <FullName>xxxx</FullName>
    <ShortName>xx</ShortName>
    <Country>UK</Country>
    <Currency>GBP</Currency>
  </Attributes>
  <PolicyID>000</PolicyID>
  <TradeDetails>
    <UniqueTradeId>000</UniqueTradeId>
    <Booking>UK</Booking>
    <Date>12/2/2019</Date>
    </TradeDetails>
</RefData>
<RefData>
  <Attributes>
    <Id>1012</Id>
    <FullName>xxx2</FullName>
    <ShortName>x2</ShortName>
    <Country>UK</Country>
    <Currency>GBP</Currency>
  </Attributes>
  <PolicyID>002</PolicyID>
  <TradeDetails>
    <UniqueTradeId>0022</UniqueTradeId>
    <Booking>UK</Booking>
    <Date>12/3/2019</Date>
    </TradeDetails>
</RefData>

I would be needing everything in the tag.

Ideally I want the headers and output to look like this:

在此处输入图片说明

I would sincerely appreciate any help I can get on this. Thanks a mil.

One correction concerning your input XML file: It has to contain a single main element (of any name) and within it your RefData elements.

So the input file actually contains:

<Main>
  <RefData>
    ...
  </RefData>
  <RefData>
    ...
  </RefData>
</Main>

To process the input XML file, you can use lxml package, so to import it start from:

from lxml import etree as et

Then I noticed that you actually don't need the whole parsed XML tree, so the usually applied scheme is to:

  • read the content of each element as soon as it has been parsed,
  • save the content (text) of any child elements in any intermediate data structure (I chose a list of dictionaries),
  • drop the source XML element (not needed any more),
  • after the reading loop, create the result DataFrame from the above intermediate data structure.

So my code looks like below:

rows = []
for _, elem in et.iterparse('RefData.xml', tag='RefData'):
    rows.append({'id':   elem.findtext('Attributes/Id'),
        'fullname':      elem.findtext('Attributes/FullName'),
        'shortname':     elem.findtext('Attributes/ShortName'),
        'country':       elem.findtext('Attributes/Country'),
        'currency':      elem.findtext('Attributes/Currency'),
        'Policy ID':     elem.findtext('PolicyID'),
        'UniqueTradeId': elem.findtext('TradeDetails/UniqueTradeId'),
        'Booking':       elem.findtext('TradeDetails/Booking'),
        'Date':          elem.findtext('TradeDetails/Date')
    })
    elem.clear()
    elem.getparent().remove(elem)
df = pd.DataFrame(rows)

To fully comprehend details, search the Web for description of lxml and each method used.

For your sample data the result is:

     id fullname shortname country currency Policy ID UniqueTradeId Booking      Date
0  1011     xxxx        xx      UK      GBP       000           000      UK 12/2/2019 
1  1012     xxx2        x2      UK      GBP       002          0022      UK 12/3/2019

Probably the last step to perform is to save the above DataFrame in a CSV file, but I suppose you know how to do it.

Another way to do it, using lxml and xpath:

   from lxml import etree
   dat = """[your FIXED xml]"""
   doc = etree.fromstring(dat)
   columns = []
   rows = []
   to_delete = ["TradeDetails",'Attributes']
   body = doc.xpath('.//RefData')
   for el in body[0].xpath('.//*'):
      columns.append(el.tag)

   for b in body:    
        items = b.xpath('.//*')
        row = []
        for item in items:
           if item.tag not in to_delete:
               row.append(item.text)
        rows.append(row)
   for col in to_delete:
      if col in columns:
         columns.remove(col)

    pd.DataFrame(rows,columns=columns)

Output is the dataframe indicated in your question.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM