简体   繁体   中英

How to open all xml sub-elements belonging to a certain element in a pandas dataframe, with each sub-element in a row

This is what my xml file looks like

<?xml version="1.0" encoding="UTF-8" ?>
<deIdi2b2>
<TEXT><![CDATA[

A bunch of random texts 

]]></TEXT>
<TAGS>
<DATE id="P0" start="16" end="26" text="2067-05-03" TYPE="DATE" comment="" />
<AGE id="P1" start="50" end="52" text="55" TYPE="AGE" comment="" />
<NAME id="P2" start="290" end="296" text="Oakley" TYPE="DOCTOR" comment="" />
<DATE id="P3" start="297" end="303" text="4/5/67" TYPE="DATE" comment="" />
<LOCATION id="P4" start="343" end="353" text="Clarkfield" TYPE="HOSPITAL" comment="" />
<DATE id="P5" start="363" end="367" text="7/67" TYPE="DATE" comment="" />
<AGE id="P6" start="637" end="639" text="37" TYPE="AGE" comment="" />
<AGE id="P7" start="694" end="696" text="66" TYPE="AGE" comment="" />
<DATE id="P8" start="755" end="759" text="2062" TYPE="DATE" comment="" />
<DATE id="P9" start="899" end="903" text="4/63" TYPE="DATE" comment="" />
<DATE id="P10" start="940" end="944" text="2065" TYPE="DATE" comment="" />
<DATE id="P11" start="1028" end="1032" text="2/67" TYPE="DATE" comment="" />
<NAME id="P12" start="1037" end="1043" text="Oakley" TYPE="DOCTOR" comment="" />
<DATE id="P13" start="1071" end="1075" text="2065" TYPE="DATE" comment="" />
<NAME id="P14" start="1974" end="1980" text="Oakley" TYPE="DOCTOR" comment="" />
<DATE id="P15" start="2284" end="2288" text="3/67" TYPE="DATE" comment="" />
</TAGS>
</deIdi2b2>

I would like each element in 'TAGS' to have its own row in a pandas dataframe

So in this case, there would be 16 rows, and the column names would be 'id', 'start', 'end', 'text', 'TYPE', and 'COMMENT'. (I don't necessarily need the sub-element name in the dataframe, as it is the same as TYPE)

##What I have tried so far

df = pd.read_xml('file.xml.txt',)

Results in

df.head()
TEXT    DATE    AGE NAME    LOCATION
0   A bunch of random texts     NaN NaN NaN NaN
1   None    NaN NaN NaN NaN

From pandas documentation

https://pandas.pydata.org/pandas-docs/dev/reference/api/pandas.read_xml.html

It looks like I have to specify something in the argument xpath , but I am having trouble figuring out what exactly.

I tried

df = pd.read_xml('file.xml', xpath=['TAGS'])

Which resulted in error

and

df = pd.read_xml('file.xml', xpath=".//TAGS")

Which df.head() resulted in


DATE    AGE NAME    LOCATION
0   NaN NaN NaN NaN

Try with lxml. You can play with root.xpath depending upon your actual Nodes .

import pandas as pd
from lxml import objectify

xml = objectify.parse('Input.xml')
root = xml.getroot()

Output = []
columns = ['ID', 'Start', 'End', 'Text','Type','Comment']

for TAGS in root.xpath('TAGS'):
    for tag in TAGS.iterchildren():
        Output.append([tag.attrib["id"],tag.attrib["start"],tag.attrib["end"],tag.attrib["text"],tag.attrib["TYPE"],tag.attrib["comment"]])

pd.DataFrame(Output,columns=columns).to_csv('./Output.csv',index=False)

Start from import of lxml.etree :

from lxml import etree as et

Then parse your source XML file:

tree = et.parse('Input.xml')
root = tree.getroot()

And to create the output DataFrame it is enough to use a single instruction including a list comprehension creating a list of dictionaries - the source for the DataFrame:

result = pd.DataFrame([ dict(it.attrib) for it in root.find('.//TAGS') ])

No need to pass any explicit list of column names.

The result is:

     id start   end        text      TYPE comment
0    P0    16    26  2067-05-03      DATE        
1    P1    50    52          55       AGE        
2    P2   290   296      Oakley    DOCTOR        
3    P3   297   303      4/5/67      DATE        
4    P4   343   353  Clarkfield  HOSPITAL        
5    P5   363   367        7/67      DATE        
6    P6   637   639          37       AGE        
7    P7   694   696          66       AGE        
8    P8   755   759        2062      DATE        
9    P9   899   903        4/63      DATE        
10  P10   940   944        2065      DATE        
11  P11  1028  1032        2/67      DATE        
12  P12  1037  1043      Oakley    DOCTOR        
13  P13  1071  1075        2065      DATE        
14  P14  1974  1980      Oakley    DOCTOR        
15  P15  2284  2288        3/67      DATE 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM