簡體   English   中英

嘗試使用 Python 解析 XLS (XML) 文件

[英]Attempting to Parse an XLS (XML) File Using Python

我有一個從 Netsuite ERP 下載的“XLS”文件。 文件根目錄顯示“.XLS”,但它實際上是一個 XML 文件。 我有一個 pandas 腳本,它將組合幾個 XLS 或 XLSX 文件,但 pandas 似乎無法處理這種奇怪的 XLS/XML 文件類型,所以我有另一個腳本試圖解析 Z3501BB093D363810B67CFED 數據並保存到 XLS 然而,下面的腳本似乎不起作用,因為它導致“無”。 誰能用我的示例代碼、新代碼或解決這個奇怪的 XLS/XML 解析問題的新方法為我指明正確的方向?

先感謝您!

XML 示例代碼:

<?xml version="1.0" encoding="utf-16"?>
<Workbook xmlns="urn:schemas-microsoft-com:office:spreadsheet" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet" xmlns:html="http://www.w3.org/TR/REC-html40">
  <DocumentProperties xmlns="urn:schemas-microsoft-com:office:office">
    <Author>NetSuite Reports</Author>
    <LastAuthor>NetSuite Reports</LastAuthor>
    <Company>NetSuite</Company>
  </DocumentProperties>
  <Styles>
    <Style ss:ID="company">
      <Alignment ss:Horizontal="Center" />
      <Font ss:Size="12" ss:Bold="1" />
    </Style>
    <Style ss:ID="subcompany">
      <Alignment ss:Horizontal="Center" />
      <Font ss:Size="14" ss:Bold="1" />
    </Style>
    <Style ss:ID="error">
      <Alignment ss:Horizontal="Center" />
      <Interior ss:Color="#f0d0d0" ss:Pattern="Solid" />
      <Font ss:Bold="1" />
    </Style>
    <Style ss:ID="header_l">
      <Alignment ss:Horizontal="Left" />
      <Font ss:Size="7" ss:Bold="1" />
      <Interior ss:Color="#d0d0d0" ss:Pattern="Solid" />
    </Style>
    <Style ss:ID="header_r">
      <Alignment ss:Horizontal="Right" />
      <Font ss:Size="7" ss:Bold="1" />
      <Interior ss:Color="#d0d0d0" ss:Pattern="Solid" />
    </Style>
    <Style ss:ID="header_c">
      <Alignment ss:Horizontal="Center" />
      <Font ss:Size="7" ss:Bold="1" />
      <Interior ss:Color="#d0d0d0" ss:Pattern="Solid" />
    </Style>
    <Style ss:ID="scheckbox">
      <Alignment ss:Vertical="Center" ss:Horizontal="Center" />
    </Style>
    <Style ss:ID="Default" ss:Name="Normal">
      <Alignment ss:Vertical="Bottom" />
      <Borders />
      <Font ss:FontName="Arial" ss:Size="8" />
      <Interior />
      <NumberFormat />
      <Protection />
    </Style>
    <Style ss:ID="s53">
      <Alignment ss:Vertical="Center" ss:Horizontal="Left" />
      <Font ss:FontName="Arial" ss:Size="8" ss:Color="#000000" ss:Bold="1" ss:Italic="0" />
      <Borders>
        <Border ss:Position="Top" ss:LineStyle="Dash" ss:Weight="1" ss:Color="#cccccc" />
      </Borders>
    </Style>
    <Style ss:ID="s52">
      <Alignment ss:Horizontal="Left" ss:Indent="1" />
      <Font ss:FontName="Arial" ss:Size="8" ss:Color="#000000" ss:Bold="0" ss:Italic="0" />
      <Borders />
    </Style>
    <Style ss:ID="s51">
      <Alignment ss:Vertical="Center" ss:Horizontal="Right" />
      <Font ss:FontName="Arial" ss:Size="8" ss:Color="#000000" ss:Bold="0" ss:Italic="0" />
      <NumberFormat ss:Format="&quot;€&quot;#,##0.00" />
      <Borders />
    </Style>
    <Style ss:ID="s50">
      <Alignment ss:Vertical="Center" ss:Horizontal="Left" />
      <Font ss:FontName="Arial" ss:Size="8" ss:Color="#000000" ss:Bold="1" ss:Italic="0" />
      <Borders />
    </Style>
    <Style ss:ID="s58">
      <Alignment ss:Horizontal="Left" ss:Indent="2" />
      <Font ss:FontName="Arial" ss:Size="8" ss:Color="#000000" ss:Bold="1" ss:Italic="0" />
      <Borders>
        <Border ss:Position="Top" ss:LineStyle="Dash" ss:Weight="1" ss:Color="#cccccc" />
      </Borders>
    </Style>
    <Style ss:ID="s54">
      <Alignment ss:Vertical="Center" ss:Horizontal="Right" />
      <Font ss:FontName="Arial" ss:Size="8" ss:Color="#000000" ss:Bold="1" ss:Italic="0" />
      <NumberFormat ss:Format="&quot;€&quot;#,##0.00" />
      <Borders>
        <Border ss:Position="Top" ss:LineStyle="Dash" ss:Weight="1" ss:Color="#cccccc" />
      </Borders>
    </Style>
    <Style ss:ID="s59">
      <Alignment ss:Horizontal="Left" ss:Indent="1" />
      <Font ss:FontName="Arial" ss:Size="8" ss:Color="#000000" ss:Bold="1" ss:Italic="0" />
      <Borders>
        <Border ss:Position="Top" ss:LineStyle="Dash" ss:Weight="1" ss:Color="#cccccc" />
      </Borders>
    </Style>
    <Style ss:ID="s56">
      <Alignment ss:Horizontal="Left" ss:Indent="2" />
      <Font ss:FontName="Arial" ss:Size="8" ss:Color="#000000" ss:Bold="1" ss:Italic="0" />
      <Borders />
    </Style>
    <Style ss:ID="s57">
      <Alignment ss:Horizontal="Left" ss:Indent="3" />
      <Font ss:FontName="Arial" ss:Size="8" ss:Color="#000000" ss:Bold="0" ss:Italic="0" />
      <Borders />
    </Style>
    <Style ss:ID="s55">
      <Alignment ss:Horizontal="Left" ss:Indent="1" />
      <Font ss:FontName="Arial" ss:Size="8" ss:Color="#000000" ss:Bold="1" ss:Italic="0" />
      <Borders />
    </Style>
    <Style ss:ID="s60">
      <Alignment ss:Vertical="Center" ss:Horizontal="Left" />
      <Font ss:FontName="Arial" ss:Size="8" ss:Color="#000000" ss:Bold="1" ss:Italic="0" />
      <Borders>
        <Border ss:Position="Top" ss:LineStyle="Dash" ss:Weight="1" ss:Color="#cccccc" />
      </Borders>
    </Style>
  </Styles>
  <Worksheet ss:Name="TrialBalance">
    <Table>
      <Row>
        <Cell ss:StyleID="company" ss:MergeAcross="1">
          <Data ss:Type="String">Parent Company</Data>
        </Cell>
      </Row>
      <Row>
        <Cell ss:StyleID="company" ss:MergeAcross="1">
          <Data ss:Type="String">Company Holdings Inc. : Company A  B.V.</Data>
        </Cell>
      </Row>
      <Row>
        <Cell ss:StyleID="subcompany" ss:MergeAcross="1">
          <Data ss:Type="String">Trial Balance</Data>
        </Cell>
      </Row>
      <Row>
        <Cell ss:StyleID="subcompany" ss:MergeAcross="1">
          <Data ss:Type="String">End of Feb 2020</Data>
        </Cell>
      </Row>
      <Row>
        <Cell ss:StyleID="subcompany" ss:MergeAcross="1">
          <Data ss:Type="String" />
        </Cell>
      </Row>
      <Row>
        <Cell ss:StyleID="subcompany" ss:MergeAcross="1">
          <Data ss:Type="String" />
        </Cell>
      </Row>
      <Row>
        <Cell ss:StyleID="header_l">
          <Data ss:Type="String">Account</Data>
        </Cell>
        <Cell ss:StyleID="header_r" ss:MergeDown="0" ss:Index="2">
          <Data ss:Type="String">Total</Data>
        </Cell>
      </Row>
      <Row>
        <Cell ss:StyleID="s50">
          <Data ss:Type="String">10000 - CASH &amp; CASH EQUIVALENTS</Data>
        </Cell>
        <Cell ss:StyleID="s51" />
      </Row>
      <Row>
        <Cell ss:StyleID="s52">
          <Data ss:Type="String">10101 - Bank - 9999 - Company A - EUR</Data>
        </Cell>
        <Cell ss:StyleID="s51">
          <Data ss:Type="Number">1234567.01</Data>
        </Cell>
      </Row>
      <Row>
        <Cell ss:StyleID="s53">
          <Data ss:Type="String">Total - 10000 - CASH &amp; CASH EQUIVALENTS</Data>
        </Cell>
        <Cell ss:Formula="SUM(R[-1]C)" ss:StyleID="s54">
          <Data ss:Type="Number">1234567.01</Data>
        </Cell>
      </Row>
    </Table>
  </Worksheet>
</Workbook>

Python 代碼解析 XML 到 XLS:

import pandas as pd
import xml.etree.cElementTree as ET

tree = ET.parse(r"C:\Users\NAME\Documents\rootfolder\examplefile.xls")
root = tree.getroot()

def getvalueofnode(node):
    """ return node text or None """
    return node.text if node is not None else None


def main():
    """ main """
    parsed_xml = tree
    dfcols = ['account', 'total']
    df_xml = pd.DataFrame(columns=dfcols)


for node in parsed_xml.getroot():
    account = node.attrib.get('Type="String"')
    total = node.find('Type="Number"')

    df_xml = df_xml.append(
        pd.Series([account, getvalueofnode(total)], index=dfcols),
        ignore_index=True)

print(df_xml)


main()

Python 解析 XML 文件結果:

  account total
0    None  None

避免通過附加像 Series 甚至 DataFrames 這樣的對象來構建數據框。 相反,構建要綁定到DataFrame的字典列表。 此外,由於您的 XML 具有默認命名空間,因此您必須分配前綴來解析命名空間下的任何元素

import pandas as pd
import xml.etree.cElementTree as ET

ns = {"doc": "urn:schemas-microsoft-com:office:spreadsheet"}

tree = ET.parse(r"C:\Path\To\Input.xml")
root = tree.getroot()

def getvalueofnode(node):
    """ return node text or None """
    return node.text if node is not None else None


def main():
    """ main """
    parsed_xml = tree

    data = []
    for i, node in enumerate(root.findall('.//doc:Row', ns)):
        if i > 6:
            data.append({'account': getvalueofnode(node.find('doc:Cell[1]/doc:Data', ns)),
                         'total': getvalueofnode(node.find('doc:Cell[2]/doc:Data', ns))})

    return(pd.DataFrame(data))

output_df = main()

print(output_df)
#                                    account       total
# 0          10000 - CASH & CASH EQUIVALENTS        None
# 1    10101 - Bank - 9999 - Company A - EUR  1234567.01
# 2  Total - 10000 - CASH & CASH EQUIVALENTS  1234567.01

Alternatively, save your Excel styled XML as xlsx with Workbook.SaveAs method using win32com (only for Windows users) and read in with pandas.read_excel skipping appropriate rows.

import win32com.client
import pandas as pd

# SAVE EXCEL FILE
try:
    xlApp = win32com.client.Dispatch("Excel.Application")
    xlWbk = xlApp.Workbooks.Open(r"C:\Path\To\Input.xml")
    xlWbk.SaveAs(r"C:\Path\To\Output.xlsx", 51)

    xlWbk.Close(True)
    xlApp.Quit()

except Exception as e:
    print(e)

finally:
    xlWbk = None; xlApp = None
    del xlWbk; del xlApp

# READ EXCEL FILE
output_df = pd.read_excel(r"C:\Path\To\Output.xlsx", skiprows = 6)

print(output_df)    
#                                    Account       Total
# 0          10000 - CASH & CASH EQUIVALENTS         NaN
# 1    10101 - Bank - 9999 - Company A - EUR  1234567.01
# 2  Total - 10000 - CASH & CASH EQUIVALENTS  1234567.01

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM