[英]Convert entries of xml file into pandas dataframe
我有以下 xml 文件。 我想提取所有以 ItemDescription 开头的行并从中创建一个数据框,这样一列包含颜色,另一列包含 ID,另一列包含字母等等。 我怎样才能做到这一点?
我使用xml.etree.ElementTree
包进行了尝试,但我无法生成数据xml.etree.ElementTree
,因为我无法访问所需行中的元素。 我不想使用pandas_read_xml
因为它只适用于 pip,我猜。 即使我更新了熊猫, pd.read_xml
也不起作用。 是否有一种可靠的方法可以使用xml.etree.ElementTree
或其他不太花哨的包来实现?
<?xml version="1.0" ?>
<OrderList>
<ItemDescriptions>
<ItemDescription Color="rosybrown" ID="0" Letter="a" Type="Letter" Weight="1.67"/>
<ItemDescription Color="lightcoral" ID="1" Letter="a" Type="Letter" Weight="0.91"/>
<ItemDescription Color="indiaread" ID="2" Letter="a" Type="Letter" Weight="0.62"/>
<ItemDescription Color="brown" ID="3" Letter="a" Type="Letter" Weight="2.92"/>
<ItemDescription Color="firedbrick" ID="4" Letter="a" Type="Letter" Weight="2.34"/>
<ItemDescription Color="maroon" ID="5" Letter="a" Type="Letter" Weight="0.53"/>
<ItemDescription Color="darkred" ID="6" Letter="a" Type="Letter" Weight="2.72"/>
</ItemDescriptions>
<ItemBundles/>
<Orders>
<Order TimeStamp="">
<Positions>
<Position Count="1" ItemDescriptionID="9"/>
<Position Count="1" ItemDescriptionID="18"/>
</Positions>
</Order>
<Order TimeStamp="">
<Positions>
<Position Count="2" ItemDescriptionID="9"/>
<Position Count="1" ItemDescriptionID="12"/>
<Position Count="2" ItemDescriptionID="14"/>
<Position Count="1" ItemDescriptionID="18"/>
<Position Count="1" ItemDescriptionID="16"/>
</Positions>
</Order>
</Orders>
</OrderList>
使用read_xml
和xpath
:
>>> pd.read_xml('data.xml', xpath='./ItemDescriptions/ItemDescription')
Color ID Letter Type Weight
0 rosybrown 0 a Letter 1.67
1 lightcoral 1 a Letter 0.91
2 indiaread 2 a Letter 0.62
3 brown 3 a Letter 2.92
4 firedbrick 4 a Letter 2.34
5 maroon 5 a Letter 0.53
6 darkred 6 a Letter 2.72
替代lxml
:
from lxml import etree
tree = etree.parse('data.xml')
df = pd.DataFrame([dict(elmt.items())
for elmt in tree.xpath('.//ItemDescription')])
使用 ElementTree(不需要外部库)
import xml.etree.ElementTree as ET
import pandas as pd
xml = '''<?xml version="1.0" ?>
<OrderList>
<ItemDescriptions>
<ItemDescription Color="rosybrown" ID="0" Letter="a" Type="Letter" Weight="1.67"/>
<ItemDescription Color="lightcoral" ID="1" Letter="a" Type="Letter" Weight="0.91"/>
<ItemDescription Color="indiaread" ID="2" Letter="a" Type="Letter" Weight="0.62"/>
<ItemDescription Color="brown" ID="3" Letter="a" Type="Letter" Weight="2.92"/>
<ItemDescription Color="firedbrick" ID="4" Letter="a" Type="Letter" Weight="2.34"/>
<ItemDescription Color="maroon" ID="5" Letter="a" Type="Letter" Weight="0.53"/>
<ItemDescription Color="darkred" ID="6" Letter="a" Type="Letter" Weight="2.72"/>
</ItemDescriptions>
<ItemBundles/>
<Orders>
<Order TimeStamp="">
<Positions>
<Position Count="1" ItemDescriptionID="9"/>
<Position Count="1" ItemDescriptionID="18"/>
</Positions>
</Order>
<Order TimeStamp="">
<Positions>
<Position Count="2" ItemDescriptionID="9"/>
<Position Count="1" ItemDescriptionID="12"/>
<Position Count="2" ItemDescriptionID="14"/>
<Position Count="1" ItemDescriptionID="18"/>
<Position Count="1" ItemDescriptionID="16"/>
</Positions>
</Order>
</Orders>
</OrderList>'''
root = ET.fromstring(xml)
data = [desc.attrib for desc in root.findall('.//ItemDescription')]
df = pd.DataFrame(data)
print(df)
输出
Color ID Letter Type Weight
0 rosybrown 0 a Letter 1.67
1 lightcoral 1 a Letter 0.91
2 indiaread 2 a Letter 0.62
3 brown 3 a Letter 2.92
4 firedbrick 4 a Letter 2.34
5 maroon 5 a Letter 0.53
6 darkred 6 a Letter 2.72
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.