[英]read xml file, convert it to table (dataframe)
this the first time I am dealing with xml file, so I am very lost.这是我第一次处理 xml 文件,所以我很迷茫。 I would appreciate any help.
我将不胜感激任何帮助。 All I want is to read the file and convert it to regular table (dataframe).
我想要的只是读取文件并将其转换为常规表(数据框)。
I have file with this structure:我有这种结构的文件:
<?xml version="1.0" encoding='UTF-8'?>
<LucroCliente xmlns:my='http://www.ms.com/pace' xmlns='http://www.ms.com/pace' Cab_Usuario='UsuI' Cab_DadosEmpresa='' Cab_RazaoEmpresa='CoLtda.' Cab_Aplicativo='Comercial' Cab_Data='25/07/2022 14:40:38' Cab_Titulo='Relatório Cab_Titulo' Selecao='Selecao' Periodo='Período: 01/01/2020 - 31/12/2020'>
<Filial Filial=''>
<Linha TotalLinha='TOA 21: 2.313.292,43'>
<Produto Coluna1='21-851611' Coluna2='CAMIO VO' Coluna3='' Coluna4='' Coluna5=''>
<AnaliseDiaria Coluna6='' Coluna7='' Coluna8='' Coluna9='' Coluna10='' Coluna11='' Coluna12='' Coluna13='' Coluna14='' Coluna15='' Coluna16=''/>
</Produto>
<Produto Coluna1='21-3667984' Coluna2='SCA4X2' Coluna3='-1' Coluna4='' Coluna5=''>
<AnaliseDiaria Coluna6='' Coluna7='' Coluna8='' Coluna9='' Coluna10='' Coluna11='' Coluna12='' Coluna13='' Coluna14='' Coluna15='' Coluna16=''/>
</Produto>
<Produto Coluna1='21-3667994' Coluna2='SCA963' Coluna3='-1' Coluna4='' Coluna5=''>
<AnaliseDiaria Coluna6='' Coluna7='' Coluna8='' Coluna9='' Coluna10='' Coluna11='' Coluna12='' Coluna13='' Coluna14='' Coluna15='' Coluna16=''/>
</Produto>
<Produto Coluna1='21-3676543' Coluna2='SCA713' Coluna3='-1' Coluna4='' Coluna5=''>
<AnaliseDiaria Coluna6='' Coluna7='' Coluna8='' Coluna9='' Coluna10='' Coluna11='' Coluna12='' Coluna13='' Coluna14='' Coluna15='' Coluna16=''/>
</Produto>
<Produto Coluna1='21-3676601' Coluna2='SCA97' Coluna3='-1' Coluna4='' Coluna5=''>
<AnaliseDiaria Coluna6='' Coluna7='' Coluna8='' Coluna9='' Coluna10='' Coluna11='' Coluna12='' Coluna13='' Coluna14='' Coluna15='' Coluna16=''/>
</Produto>
<Produto Coluna1='21-3814014' Coluna2='CAMIX2' Coluna3='' Coluna4='' Coluna5=''>
<AnaliseDiaria Coluna6='' Coluna7='' Coluna8='' Coluna9='' Coluna10='' Coluna11='' Coluna12='' Coluna13='' Coluna14='' Coluna15='' Coluna16=''/>
</Produto>
<Produto Coluna1='21-3814087' Coluna2='SCA56' Coluna3='' Coluna4='' Coluna5=''>
<AnaliseDiaria Coluna6='' Coluna7='' Coluna8='' Coluna9='' Coluna10='' Coluna11='' Coluna12='' Coluna13='' Coluna14='' Coluna15='' Coluna16=''/>
<AnaliseDiaria Coluna6='19/06/20' Coluna7='01' Coluna8='EP 202022777' Coluna9='1 UN' Coluna10='195.000,00' Coluna11='195.000,00' Coluna12='1' Coluna13='195.000,00' Coluna14='195.000,00' Coluna15='NF9' Coluna16='10203910A'/>
<AnaliseDiaria Coluna6='13/07/20' Coluna7='01' Coluna8='RCP G 41765' Coluna9='0 UN' Coluna10='' Coluna11='90,00' Coluna12='1' Coluna13='195.090,00' Coluna14='195.090,00' Coluna15='' Coluna16=''/>
<AnaliseDiaria Coluna6='27/07/20' Coluna7='01' Coluna8='RCP G 41767' Coluna9='0 UN' Coluna10='' Coluna11='180,00' Coluna12='1' Coluna13='195.270,00' Coluna14='195.270,00' Coluna15='' Coluna16=''/>
<AnaliseDiaria Coluna6='27/07/20' Coluna7='01' Coluna8='RCP G 41768' Coluna9='0 UN' Coluna10='' Coluna11='212,60' Coluna12='1' Coluna13='195.482,60' Coluna14='195.482,60' Coluna15='' Coluna16=''/>
<AnaliseDiaria Coluna6='27/07/20' Coluna7='01' Coluna8='RCP G 41770' Coluna9='0 UN' Coluna10='' Coluna11='145,20' Coluna12='1' Coluna13='195.627,80' Coluna14='195.627,80' Coluna15='' Coluna16=''/>
<AnaliseDiaria Coluna6='27/07/20' Coluna7='01' Coluna8='RCP G 41771' Coluna9='0 UN' Coluna10='' Coluna11='8.902,02' Coluna12='1' Coluna13='204.529,82' Coluna14='204.529,82' Coluna15='' Coluna16=''/>
<AnaliseDiaria Coluna6='27/07/20' Coluna7='01' Coluna8='VP 323755' Coluna9='-1 UN' Coluna10='204.529,82' Coluna11='-204.529,82' Coluna12='0' Coluna13='' Coluna14='' Coluna15='' Coluna16='158PES'/>
</Produto>
</Linha>
</Filial>
</LucroCliente>
I tried multiple solutions I found here but nothing worked out, for example: first solution:我尝试了在这里找到的多种解决方案,但都没有解决,例如:第一个解决方案:
xml_data = open('file.xml', 'r').read()
root = et.XML(xml_data) # Parse XML
data = []
cols = []
for i, child in enumerate(root):
data.append([subchild.text for subchild in child])
cols.append(child.tag)
df = pd.DataFrame(data).T
df.columns = cols
second solution:第二种解决方案:
xml_data = objectify.parse('file.xml')
root = xml_data.getroot()
data = []
cols = []
for i in range(len(root.getchildren())):
child = root.getchildren()[i]
data.append([subchild.text for subchild in child.getchildren()])
cols.append(child.tag)
df = pd.DataFrame(data).T
df.columns = cols
My end table will look like below:我的茶几将如下所示:
| Coluna1 | Coluna2 | Coluna3 | Coluna4 | coluna2 | couna6 | coluna7 | coluna8 | coluna9 | colun10 | coluna11 | coluna12 | coluna13 | coluna14 | coluna15 | coluna16 |
| --------- | -------- | ------- | ------- | ------- | ------ | ------- | ------- | ------- | ------- | -------- | -------- | -------- | -------- | -------- | -------- |
| 21-851611 | CAMIO VO | | | | | | | | | | | | | | |
| | | | | | | | | | | | | | | | |
| | | | | | | | | | | | | | | | |
| | | | | | | | | | | | | | | | |
import xml.etree.ElementTree as ET
import csv
import numpy as np
from tqdm import tqdm
path = 'Filename.xml'
from lxml import etree
all_columns = {
# column names for the dataframe
}
context = etree.iterparse(path,events=('end',), tag='row')
def find_missing_keys(input_keys, target_keys):
return list(set(target_keys) - set(input_keys))
with open('Filename.csv', 'w', encoding="utf-8") as csvFile:
writer = csv.DictWriter(csvFile, fieldnames=list(all_columns))
writer.writeheader()
for i, ret in tqdm(enumerate(context)):
event, element = ret
row = dict(element.attrib)
missing_keys = find_missing_keys(list(row.keys()), list(all_columns))
for each_missing_key in missing_keys:
row[each_missing_key] = np.nan
writer.writerow(row)
element.clear()
while element.getprevious() is not None:
del element.getparent()[0]
Hope this helps!希望这可以帮助!
Fortunately, in the case of your xml in the question, you can use the pandas read_xml()
method, although you'll have to skirt around the namespaces issue:幸运的是,对于问题中的 xml,您可以使用 pandas
read_xml()
方法,尽管您必须避开命名空间问题:
import pandas as pd
pd.read_xml(file.xml,xpath='//*[local-name()="Linha"]//*[local-name()="Produto"]')
Output: Output:
Coluna1 Coluna2 Coluna3 Coluna4 Coluna5 {http://www.ms.com/pace}AnaliseDiaria
0 21-851611 CAMIO VO NaN NaN NaN NaN
1 21-3667984 SCA4X2 -1.0 NaN NaN NaN
2 21-3667994 SCA963 -1.0 NaN NaN NaN
etc. If you are not interested in one column or anothter, you can simply drop()
it.等等。如果您对某一列或另一列不感兴趣,您可以简单地
drop()
它。
Given the two levels of nodes that cover the Coluna attributes, consider XSLT , the special-purpose language designed to transform or style original XML files.鉴于覆盖Coluna属性的两个节点级别,请考虑XSLT ,这是一种专用语言,旨在转换或设置原始 XML 文件的样式。 Python's
lxml
can run XSLT 1.0 scripts and being the default parse to pandas.read_xml
can transform your raw XML into a flatter version to parse to DataFrame. Python's
lxml
can run XSLT 1.0 scripts and being the default parse to pandas.read_xml
can transform your raw XML into a flatter version to parse to DataFrame.
XSLT (save as.xsl file, a special.xml file) XSLT (另存为.xsl文件,一个特殊的.xml文件)
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:pace='http://www.ms.com/pace'>
<xsl:output method="xml" omit-xml-declaration="no" indent="yes"/>
<xsl:strip-space elements="*"/>
<!-- REDESIGN XML TO ONLY RETURN AnaliseDiaria NODES -->
<xsl:template match="/*">
<xsl:copy>
<xsl:apply-templates select="descendant::pace:AnaliseDiaria"/>
</xsl:copy>
</xsl:template>
<!-- REDESIGN AnaliseDiaria NODES -->
<xsl:template match="pace:AnaliseDiaria">
<xsl:copy>
<!-- BRING DOWN Produto ATTRIBUTES WITH CURRENT ATTRIBUTES -->
<xsl:copy-of select="ancestor::pace:Produto/@*|@*"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
Python Python
analise_diaria_df = pd.read("input.xml", stylesheet="style.xsl")
analise_diaria_df
# Coluna1 Coluna2 Coluna3 ... Coluna14 Coluna15 Coluna16
# 0 21-851611 CAMIO VO NaN ... NaN NaN NaN
# 1 21-3667984 SCA4X2 -1.0 ... NaN NaN NaN
# 2 21-3667994 SCA963 -1.0 ... NaN NaN NaN
# 3 21-3676543 SCA713 -1.0 ... NaN NaN NaN
# 4 21-3676601 SCA97 -1.0 ... NaN NaN NaN
# 5 21-3814014 CAMIX2 NaN ... NaN NaN NaN
# 6 21-3814087 SCA56 NaN ... NaN NaN NaN
# 7 21-3814087 SCA56 NaN ... 195.000,00 NF9 10203910A
# 8 21-3814087 SCA56 NaN ... 195.090,00 NaN NaN
# 9 21-3814087 SCA56 NaN ... 195.270,00 NaN NaN
# 10 21-3814087 SCA56 NaN ... 195.482,60 NaN NaN
# 11 21-3814087 SCA56 NaN ... 195.627,80 NaN NaN
# 12 21-3814087 SCA56 NaN ... 204.529,82 NaN NaN
# 13 21-3814087 SCA56 NaN ... NaN NaN 158PES
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.