[英]Attempting to Parse an XLS (XML) File Using Python
我有一个从 Netsuite ERP 下载的“XLS”文件。 文件根目录显示“.XLS”,但它实际上是一个 XML 文件。 我有一个 pandas 脚本,它将组合几个 XLS 或 XLSX 文件,但 pandas 似乎无法处理这种奇怪的 XLS/XML 文件类型,所以我有另一个脚本试图解析 Z3501BB093D363810B67CFED 数据并保存到 XLS 然而,下面的脚本似乎不起作用,因为它导致“无”。 谁能用我的示例代码、新代码或解决这个奇怪的 XLS/XML 解析问题的新方法为我指明正确的方向?
先感谢您!
XML 示例代码:
<?xml version="1.0" encoding="utf-16"?>
<Workbook xmlns="urn:schemas-microsoft-com:office:spreadsheet" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet" xmlns:html="http://www.w3.org/TR/REC-html40">
<DocumentProperties xmlns="urn:schemas-microsoft-com:office:office">
<Author>NetSuite Reports</Author>
<LastAuthor>NetSuite Reports</LastAuthor>
<Company>NetSuite</Company>
</DocumentProperties>
<Styles>
<Style ss:ID="company">
<Alignment ss:Horizontal="Center" />
<Font ss:Size="12" ss:Bold="1" />
</Style>
<Style ss:ID="subcompany">
<Alignment ss:Horizontal="Center" />
<Font ss:Size="14" ss:Bold="1" />
</Style>
<Style ss:ID="error">
<Alignment ss:Horizontal="Center" />
<Interior ss:Color="#f0d0d0" ss:Pattern="Solid" />
<Font ss:Bold="1" />
</Style>
<Style ss:ID="header_l">
<Alignment ss:Horizontal="Left" />
<Font ss:Size="7" ss:Bold="1" />
<Interior ss:Color="#d0d0d0" ss:Pattern="Solid" />
</Style>
<Style ss:ID="header_r">
<Alignment ss:Horizontal="Right" />
<Font ss:Size="7" ss:Bold="1" />
<Interior ss:Color="#d0d0d0" ss:Pattern="Solid" />
</Style>
<Style ss:ID="header_c">
<Alignment ss:Horizontal="Center" />
<Font ss:Size="7" ss:Bold="1" />
<Interior ss:Color="#d0d0d0" ss:Pattern="Solid" />
</Style>
<Style ss:ID="scheckbox">
<Alignment ss:Vertical="Center" ss:Horizontal="Center" />
</Style>
<Style ss:ID="Default" ss:Name="Normal">
<Alignment ss:Vertical="Bottom" />
<Borders />
<Font ss:FontName="Arial" ss:Size="8" />
<Interior />
<NumberFormat />
<Protection />
</Style>
<Style ss:ID="s53">
<Alignment ss:Vertical="Center" ss:Horizontal="Left" />
<Font ss:FontName="Arial" ss:Size="8" ss:Color="#000000" ss:Bold="1" ss:Italic="0" />
<Borders>
<Border ss:Position="Top" ss:LineStyle="Dash" ss:Weight="1" ss:Color="#cccccc" />
</Borders>
</Style>
<Style ss:ID="s52">
<Alignment ss:Horizontal="Left" ss:Indent="1" />
<Font ss:FontName="Arial" ss:Size="8" ss:Color="#000000" ss:Bold="0" ss:Italic="0" />
<Borders />
</Style>
<Style ss:ID="s51">
<Alignment ss:Vertical="Center" ss:Horizontal="Right" />
<Font ss:FontName="Arial" ss:Size="8" ss:Color="#000000" ss:Bold="0" ss:Italic="0" />
<NumberFormat ss:Format=""€"#,##0.00" />
<Borders />
</Style>
<Style ss:ID="s50">
<Alignment ss:Vertical="Center" ss:Horizontal="Left" />
<Font ss:FontName="Arial" ss:Size="8" ss:Color="#000000" ss:Bold="1" ss:Italic="0" />
<Borders />
</Style>
<Style ss:ID="s58">
<Alignment ss:Horizontal="Left" ss:Indent="2" />
<Font ss:FontName="Arial" ss:Size="8" ss:Color="#000000" ss:Bold="1" ss:Italic="0" />
<Borders>
<Border ss:Position="Top" ss:LineStyle="Dash" ss:Weight="1" ss:Color="#cccccc" />
</Borders>
</Style>
<Style ss:ID="s54">
<Alignment ss:Vertical="Center" ss:Horizontal="Right" />
<Font ss:FontName="Arial" ss:Size="8" ss:Color="#000000" ss:Bold="1" ss:Italic="0" />
<NumberFormat ss:Format=""€"#,##0.00" />
<Borders>
<Border ss:Position="Top" ss:LineStyle="Dash" ss:Weight="1" ss:Color="#cccccc" />
</Borders>
</Style>
<Style ss:ID="s59">
<Alignment ss:Horizontal="Left" ss:Indent="1" />
<Font ss:FontName="Arial" ss:Size="8" ss:Color="#000000" ss:Bold="1" ss:Italic="0" />
<Borders>
<Border ss:Position="Top" ss:LineStyle="Dash" ss:Weight="1" ss:Color="#cccccc" />
</Borders>
</Style>
<Style ss:ID="s56">
<Alignment ss:Horizontal="Left" ss:Indent="2" />
<Font ss:FontName="Arial" ss:Size="8" ss:Color="#000000" ss:Bold="1" ss:Italic="0" />
<Borders />
</Style>
<Style ss:ID="s57">
<Alignment ss:Horizontal="Left" ss:Indent="3" />
<Font ss:FontName="Arial" ss:Size="8" ss:Color="#000000" ss:Bold="0" ss:Italic="0" />
<Borders />
</Style>
<Style ss:ID="s55">
<Alignment ss:Horizontal="Left" ss:Indent="1" />
<Font ss:FontName="Arial" ss:Size="8" ss:Color="#000000" ss:Bold="1" ss:Italic="0" />
<Borders />
</Style>
<Style ss:ID="s60">
<Alignment ss:Vertical="Center" ss:Horizontal="Left" />
<Font ss:FontName="Arial" ss:Size="8" ss:Color="#000000" ss:Bold="1" ss:Italic="0" />
<Borders>
<Border ss:Position="Top" ss:LineStyle="Dash" ss:Weight="1" ss:Color="#cccccc" />
</Borders>
</Style>
</Styles>
<Worksheet ss:Name="TrialBalance">
<Table>
<Row>
<Cell ss:StyleID="company" ss:MergeAcross="1">
<Data ss:Type="String">Parent Company</Data>
</Cell>
</Row>
<Row>
<Cell ss:StyleID="company" ss:MergeAcross="1">
<Data ss:Type="String">Company Holdings Inc. : Company A B.V.</Data>
</Cell>
</Row>
<Row>
<Cell ss:StyleID="subcompany" ss:MergeAcross="1">
<Data ss:Type="String">Trial Balance</Data>
</Cell>
</Row>
<Row>
<Cell ss:StyleID="subcompany" ss:MergeAcross="1">
<Data ss:Type="String">End of Feb 2020</Data>
</Cell>
</Row>
<Row>
<Cell ss:StyleID="subcompany" ss:MergeAcross="1">
<Data ss:Type="String" />
</Cell>
</Row>
<Row>
<Cell ss:StyleID="subcompany" ss:MergeAcross="1">
<Data ss:Type="String" />
</Cell>
</Row>
<Row>
<Cell ss:StyleID="header_l">
<Data ss:Type="String">Account</Data>
</Cell>
<Cell ss:StyleID="header_r" ss:MergeDown="0" ss:Index="2">
<Data ss:Type="String">Total</Data>
</Cell>
</Row>
<Row>
<Cell ss:StyleID="s50">
<Data ss:Type="String">10000 - CASH & CASH EQUIVALENTS</Data>
</Cell>
<Cell ss:StyleID="s51" />
</Row>
<Row>
<Cell ss:StyleID="s52">
<Data ss:Type="String">10101 - Bank - 9999 - Company A - EUR</Data>
</Cell>
<Cell ss:StyleID="s51">
<Data ss:Type="Number">1234567.01</Data>
</Cell>
</Row>
<Row>
<Cell ss:StyleID="s53">
<Data ss:Type="String">Total - 10000 - CASH & CASH EQUIVALENTS</Data>
</Cell>
<Cell ss:Formula="SUM(R[-1]C)" ss:StyleID="s54">
<Data ss:Type="Number">1234567.01</Data>
</Cell>
</Row>
</Table>
</Worksheet>
</Workbook>
Python 代码解析 XML 到 XLS:
import pandas as pd
import xml.etree.cElementTree as ET
tree = ET.parse(r"C:\Users\NAME\Documents\rootfolder\examplefile.xls")
root = tree.getroot()
def getvalueofnode(node):
""" return node text or None """
return node.text if node is not None else None
def main():
""" main """
parsed_xml = tree
dfcols = ['account', 'total']
df_xml = pd.DataFrame(columns=dfcols)
for node in parsed_xml.getroot():
account = node.attrib.get('Type="String"')
total = node.find('Type="Number"')
df_xml = df_xml.append(
pd.Series([account, getvalueofnode(total)], index=dfcols),
ignore_index=True)
print(df_xml)
main()
Python 解析 XML 文件结果:
account total
0 None None
避免通过附加像 Series 甚至 DataFrames 这样的对象来构建数据框。 相反,构建要绑定到DataFrame
的字典列表。 此外,由于您的 XML 具有默认命名空间,因此您必须分配前缀来解析命名空间下的任何元素
import pandas as pd
import xml.etree.cElementTree as ET
ns = {"doc": "urn:schemas-microsoft-com:office:spreadsheet"}
tree = ET.parse(r"C:\Path\To\Input.xml")
root = tree.getroot()
def getvalueofnode(node):
""" return node text or None """
return node.text if node is not None else None
def main():
""" main """
parsed_xml = tree
data = []
for i, node in enumerate(root.findall('.//doc:Row', ns)):
if i > 6:
data.append({'account': getvalueofnode(node.find('doc:Cell[1]/doc:Data', ns)),
'total': getvalueofnode(node.find('doc:Cell[2]/doc:Data', ns))})
return(pd.DataFrame(data))
output_df = main()
print(output_df)
# account total
# 0 10000 - CASH & CASH EQUIVALENTS None
# 1 10101 - Bank - 9999 - Company A - EUR 1234567.01
# 2 Total - 10000 - CASH & CASH EQUIVALENTS 1234567.01
Alternatively, save your Excel styled XML as xlsx
with Workbook.SaveAs method using win32com
(only for Windows users) and read in with pandas.read_excel
skipping appropriate rows.
import win32com.client
import pandas as pd
# SAVE EXCEL FILE
try:
xlApp = win32com.client.Dispatch("Excel.Application")
xlWbk = xlApp.Workbooks.Open(r"C:\Path\To\Input.xml")
xlWbk.SaveAs(r"C:\Path\To\Output.xlsx", 51)
xlWbk.Close(True)
xlApp.Quit()
except Exception as e:
print(e)
finally:
xlWbk = None; xlApp = None
del xlWbk; del xlApp
# READ EXCEL FILE
output_df = pd.read_excel(r"C:\Path\To\Output.xlsx", skiprows = 6)
print(output_df)
# Account Total
# 0 10000 - CASH & CASH EQUIVALENTS NaN
# 1 10101 - Bank - 9999 - Company A - EUR 1234567.01
# 2 Total - 10000 - CASH & CASH EQUIVALENTS 1234567.01
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.