This is what my xml file looks like
<?xml version="1.0" encoding="UTF-8" ?>
<deIdi2b2>
<TEXT><![CDATA[
A bunch of random texts
]]></TEXT>
<TAGS>
<DATE id="P0" start="16" end="26" text="2067-05-03" TYPE="DATE" comment="" />
<AGE id="P1" start="50" end="52" text="55" TYPE="AGE" comment="" />
<NAME id="P2" start="290" end="296" text="Oakley" TYPE="DOCTOR" comment="" />
<DATE id="P3" start="297" end="303" text="4/5/67" TYPE="DATE" comment="" />
<LOCATION id="P4" start="343" end="353" text="Clarkfield" TYPE="HOSPITAL" comment="" />
<DATE id="P5" start="363" end="367" text="7/67" TYPE="DATE" comment="" />
<AGE id="P6" start="637" end="639" text="37" TYPE="AGE" comment="" />
<AGE id="P7" start="694" end="696" text="66" TYPE="AGE" comment="" />
<DATE id="P8" start="755" end="759" text="2062" TYPE="DATE" comment="" />
<DATE id="P9" start="899" end="903" text="4/63" TYPE="DATE" comment="" />
<DATE id="P10" start="940" end="944" text="2065" TYPE="DATE" comment="" />
<DATE id="P11" start="1028" end="1032" text="2/67" TYPE="DATE" comment="" />
<NAME id="P12" start="1037" end="1043" text="Oakley" TYPE="DOCTOR" comment="" />
<DATE id="P13" start="1071" end="1075" text="2065" TYPE="DATE" comment="" />
<NAME id="P14" start="1974" end="1980" text="Oakley" TYPE="DOCTOR" comment="" />
<DATE id="P15" start="2284" end="2288" text="3/67" TYPE="DATE" comment="" />
</TAGS>
</deIdi2b2>
I would like each element in 'TAGS' to have its own row in a pandas dataframe
So in this case, there would be 16 rows, and the column names would be 'id', 'start', 'end', 'text', 'TYPE', and 'COMMENT'. (I don't necessarily need the sub-element name in the dataframe, as it is the same as TYPE)
##What I have tried so far
df = pd.read_xml('file.xml.txt',)
Results in
df.head()
TEXT DATE AGE NAME LOCATION
0 A bunch of random texts NaN NaN NaN NaN
1 None NaN NaN NaN NaN
From pandas documentation
https://pandas.pydata.org/pandas-docs/dev/reference/api/pandas.read_xml.html
It looks like I have to specify something in the argument xpath
, but I am having trouble figuring out what exactly.
I tried
df = pd.read_xml('file.xml', xpath=['TAGS'])
Which resulted in error
and
df = pd.read_xml('file.xml', xpath=".//TAGS")
Which df.head()
resulted in
DATE AGE NAME LOCATION
0 NaN NaN NaN NaN
Try with lxml. You can play with root.xpath
depending upon your actual Nodes
.
import pandas as pd
from lxml import objectify
xml = objectify.parse('Input.xml')
root = xml.getroot()
Output = []
columns = ['ID', 'Start', 'End', 'Text','Type','Comment']
for TAGS in root.xpath('TAGS'):
for tag in TAGS.iterchildren():
Output.append([tag.attrib["id"],tag.attrib["start"],tag.attrib["end"],tag.attrib["text"],tag.attrib["TYPE"],tag.attrib["comment"]])
pd.DataFrame(Output,columns=columns).to_csv('./Output.csv',index=False)
Start from import of lxml.etree :
from lxml import etree as et
Then parse your source XML file:
tree = et.parse('Input.xml')
root = tree.getroot()
And to create the output DataFrame it is enough to use a single instruction including a list comprehension creating a list of dictionaries - the source for the DataFrame:
result = pd.DataFrame([ dict(it.attrib) for it in root.find('.//TAGS') ])
No need to pass any explicit list of column names.
The result is:
id start end text TYPE comment
0 P0 16 26 2067-05-03 DATE
1 P1 50 52 55 AGE
2 P2 290 296 Oakley DOCTOR
3 P3 297 303 4/5/67 DATE
4 P4 343 353 Clarkfield HOSPITAL
5 P5 363 367 7/67 DATE
6 P6 637 639 37 AGE
7 P7 694 696 66 AGE
8 P8 755 759 2062 DATE
9 P9 899 903 4/63 DATE
10 P10 940 944 2065 DATE
11 P11 1028 1032 2/67 DATE
12 P12 1037 1043 Oakley DOCTOR
13 P13 1071 1075 2065 DATE
14 P14 1974 1980 Oakley DOCTOR
15 P15 2284 2288 3/67 DATE
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.