简体   繁体   中英

Convert entries of xml file into pandas dataframe

I have the following xml file. I want to extract all lines starting with ItemDescription and create a dataframe out of them such that one column contains the color, another the ID, another the Letter and so on. How can I do this?

I tried it with the package xml.etree.ElementTree but I could not produce dataframe because I could not access the elements in the desired lines. I do not want to use pandas_read_xml because it works only with pip, I guess. pd.read_xml also does not work even though I updated pandas. Is there a reliable way to do it with xml.etree.ElementTree or with another package that is not too fancy?

<?xml version="1.0" ?>
<OrderList>
   <ItemDescriptions>
      <ItemDescription Color="rosybrown" ID="0" Letter="a" Type="Letter" Weight="1.67"/>
      <ItemDescription Color="lightcoral" ID="1" Letter="a" Type="Letter" Weight="0.91"/>
      <ItemDescription Color="indiaread" ID="2" Letter="a" Type="Letter" Weight="0.62"/>
      <ItemDescription Color="brown" ID="3" Letter="a" Type="Letter" Weight="2.92"/>
      <ItemDescription Color="firedbrick" ID="4" Letter="a" Type="Letter" Weight="2.34"/>
      <ItemDescription Color="maroon" ID="5" Letter="a" Type="Letter" Weight="0.53"/>
      <ItemDescription Color="darkred" ID="6" Letter="a" Type="Letter" Weight="2.72"/>
   </ItemDescriptions>
   <ItemBundles/>
   <Orders>
      <Order TimeStamp="">
         <Positions>
            <Position Count="1" ItemDescriptionID="9"/>
            <Position Count="1" ItemDescriptionID="18"/>
         </Positions>
      </Order>
      <Order TimeStamp="">
         <Positions>
            <Position Count="2" ItemDescriptionID="9"/>
            <Position Count="1" ItemDescriptionID="12"/>
            <Position Count="2" ItemDescriptionID="14"/>
            <Position Count="1" ItemDescriptionID="18"/>
            <Position Count="1" ItemDescriptionID="16"/>
         </Positions>
      </Order>
   </Orders>
</OrderList>

Use read_xml and xpath :

>>> pd.read_xml('data.xml', xpath='./ItemDescriptions/ItemDescription')

        Color  ID Letter    Type  Weight
0   rosybrown   0      a  Letter    1.67
1  lightcoral   1      a  Letter    0.91
2   indiaread   2      a  Letter    0.62
3       brown   3      a  Letter    2.92
4  firedbrick   4      a  Letter    2.34
5      maroon   5      a  Letter    0.53
6     darkred   6      a  Letter    2.72

Alternative with lxml :

from lxml import etree

tree = etree.parse('data.xml')

df = pd.DataFrame([dict(elmt.items()) 
                       for elmt in tree.xpath('.//ItemDescription')])

Use ElementTree (no external lib is required)

import xml.etree.ElementTree as ET
import pandas as pd

xml = '''<?xml version="1.0" ?>
<OrderList>
   <ItemDescriptions>
      <ItemDescription Color="rosybrown" ID="0" Letter="a" Type="Letter" Weight="1.67"/>
      <ItemDescription Color="lightcoral" ID="1" Letter="a" Type="Letter" Weight="0.91"/>
      <ItemDescription Color="indiaread" ID="2" Letter="a" Type="Letter" Weight="0.62"/>
      <ItemDescription Color="brown" ID="3" Letter="a" Type="Letter" Weight="2.92"/>
      <ItemDescription Color="firedbrick" ID="4" Letter="a" Type="Letter" Weight="2.34"/>
      <ItemDescription Color="maroon" ID="5" Letter="a" Type="Letter" Weight="0.53"/>
      <ItemDescription Color="darkred" ID="6" Letter="a" Type="Letter" Weight="2.72"/>
   </ItemDescriptions>
   <ItemBundles/>
   <Orders>
      <Order TimeStamp="">
         <Positions>
            <Position Count="1" ItemDescriptionID="9"/>
            <Position Count="1" ItemDescriptionID="18"/>
         </Positions>
      </Order>
      <Order TimeStamp="">
         <Positions>
            <Position Count="2" ItemDescriptionID="9"/>
            <Position Count="1" ItemDescriptionID="12"/>
            <Position Count="2" ItemDescriptionID="14"/>
            <Position Count="1" ItemDescriptionID="18"/>
            <Position Count="1" ItemDescriptionID="16"/>
         </Positions>
      </Order>
   </Orders>
</OrderList>'''

root = ET.fromstring(xml)
data = [desc.attrib for desc in root.findall('.//ItemDescription')]
df = pd.DataFrame(data)
print(df)

output

        Color ID Letter    Type Weight
0   rosybrown  0      a  Letter   1.67
1  lightcoral  1      a  Letter   0.91
2   indiaread  2      a  Letter   0.62
3       brown  3      a  Letter   2.92
4  firedbrick  4      a  Letter   2.34
5      maroon  5      a  Letter   0.53
6     darkred  6      a  Letter   2.72

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM