Converting complex XML file to Pandas dataframe/CSV - Python

Question

I'm currently in the middle of converting a complex XML file to csv or pandas df. I have zero experience with xml data format and all the code suggestions I found online are just not working for me. Can anyone kindly help me with this?

There are lots of elements in the data that I do not need so I won't include those here.

For privacy reasons I won't be uploading the original data here but I'll be sharing what the structure looks like.

<RefData>
  <Attributes>
    <Id>1011</Id>
    <FullName>xxxx</FullName>
    <ShortName>xx</ShortName>
    <Country>UK</Country>
    <Currency>GBP</Currency>
  </Attributes>
  <PolicyID>000</PolicyID>
  <TradeDetails>
    <UniqueTradeId>000</UniqueTradeId>
    <Booking>UK</Booking>
    <Date>12/2/2019</Date>
    </TradeDetails>
</RefData>
<RefData>
  <Attributes>
    <Id>1012</Id>
    <FullName>xxx2</FullName>
    <ShortName>x2</ShortName>
    <Country>UK</Country>
    <Currency>GBP</Currency>
  </Attributes>
  <PolicyID>002</PolicyID>
  <TradeDetails>
    <UniqueTradeId>0022</UniqueTradeId>
    <Booking>UK</Booking>
    <Date>12/3/2019</Date>
    </TradeDetails>
</RefData>

I would be needing everything in the tag.

Ideally I want the headers and output to look like this:

I would sincerely appreciate any help I can get on this. Thanks a mil.

Answer 1

One correction concerning your input XML file: It has to contain a single main element (of any name) and within it your RefData elements.

So the input file actually contains:

<Main>
  <RefData>
    ...
  </RefData>
  <RefData>
    ...
  </RefData>
</Main>

To process the input XML file, you can use lxml package, so to import it start from:

from lxml import etree as et

Then I noticed that you actually don't need the whole parsed XML tree, so the usually applied scheme is to:

read the content of each element as soon as it has been parsed,
save the content (text) of any child elements in any intermediate data structure (I chose a list of dictionaries),
drop the source XML element (not needed any more),
after the reading loop, create the result DataFrame from the above intermediate data structure.

So my code looks like below:

rows = []
for _, elem in et.iterparse('RefData.xml', tag='RefData'):
    rows.append({'id':   elem.findtext('Attributes/Id'),
        'fullname':      elem.findtext('Attributes/FullName'),
        'shortname':     elem.findtext('Attributes/ShortName'),
        'country':       elem.findtext('Attributes/Country'),
        'currency':      elem.findtext('Attributes/Currency'),
        'Policy ID':     elem.findtext('PolicyID'),
        'UniqueTradeId': elem.findtext('TradeDetails/UniqueTradeId'),
        'Booking':       elem.findtext('TradeDetails/Booking'),
        'Date':          elem.findtext('TradeDetails/Date')
    })
    elem.clear()
    elem.getparent().remove(elem)
df = pd.DataFrame(rows)

To fully comprehend details, search the Web for description of lxml and each method used.

For your sample data the result is:

     id fullname shortname country currency Policy ID UniqueTradeId Booking      Date
0  1011     xxxx        xx      UK      GBP       000           000      UK 12/2/2019 
1  1012     xxx2        x2      UK      GBP       002          0022      UK 12/3/2019

Probably the last step to perform is to save the above DataFrame in a CSV file, but I suppose you know how to do it.

Answer 2

Another way to do it, using lxml and xpath:

   from lxml import etree
   dat = """[your FIXED xml]"""
   doc = etree.fromstring(dat)
   columns = []
   rows = []
   to_delete = ["TradeDetails",'Attributes']
   body = doc.xpath('.//RefData')
   for el in body[0].xpath('.//*'):
      columns.append(el.tag)

   for b in body:    
        items = b.xpath('.//*')
        row = []
        for item in items:
           if item.tag not in to_delete:
               row.append(item.text)
        rows.append(row)
   for col in to_delete:
      if col in columns:
         columns.remove(col)

    pd.DataFrame(rows,columns=columns)

Output is the dataframe indicated in your question.

Converting complex XML file to Pandas dataframe/CSV - Python

Question

2 answers

solution1
3 2020-03-26 12:21:16

solution2
1 ACCPTED 2020-03-26 12:45:05

Converting complex XML file to Pandas dataframe/CSV - Python

Question

2 answers

solution1 3 2020-03-26 12:21:16

solution2 1 ACCPTED 2020-03-26 12:45:05

solution1
3 2020-03-26 12:21:16

solution2
1 ACCPTED 2020-03-26 12:45:05