用pandas读取Excel XML.xls文件

[英]Read Excel XML .xls file with pandas

I'm aware of a number of previously asked questions, but none of the solutions given work on the reproducible example that I provide below.我知道许多以前提出的问题,但没有一个解决方案适用于我在下面提供的可重现示例。

I am trying to read in .xls files from http://www.eia.gov/coal/data.cfm#production -- specifically the Historical detailed coal production data (1983-2013) coalpublic2012.xls file that's freely available via the dropdown.我正在尝试从http://www.eia.gov/coal/data.cfm#production读取.xls文件——特别是历史详细煤炭生产数据 (1983-2013) coalpublic2012.xls文件,可通过以下方式免费获得落下。 Pandas cannot read it. Pandas 无法读取。

In contrast, the file for the most recent year available, 2013, coalpublic2013.xls file, works without a problem:相比之下,最近一年可用的文件 2013 年的coalpublic2013.xls文件可以正常工作:

import pandas as pd
df1 = pd.read_excel("coalpublic2013.xls")

but the next decade of .xls files (2004-2012) do not load.但下一个十年的.xls文件(2004-2012)不会加载。 I have looked at these files with Excel, and they open, and are not corrupted.我用 Excel 查看了这些文件,它们打开了,并且没有损坏。

The error that I get from pandas is:我从 pandas 得到的错误是:

XLRDError                                 Traceback (most recent call last)
<ipython-input-28-0da33766e9d2> in <module>()
----> 1 df = pd.read_excel("coalpublic2012.xlsx")

/Users/jonathan/anaconda/lib/python2.7/site-packages/pandas/io/excel.pyc in read_excel(io, sheetname, header, skiprows, skip_footer, index_col, parse_cols, parse_dates, date_parser, na_values, thousands, convert_float, has_index_names, converters, engine, **kwds)
    162     if not isinstance(io, ExcelFile):
--> 163         io = ExcelFile(io, engine=engine)
    165     return io._parse_excel(

/Users/jonathan/anaconda/lib/python2.7/site-packages/pandas/io/excel.pyc in __init__(self, io, **kwds)
    204                 self.book = xlrd.open_workbook(file_contents=data)
    205             else:
--> 206                 self.book = xlrd.open_workbook(io)
    207         elif engine == 'xlrd' and isinstance(io, xlrd.Book):
    208             self.book = io

/Users/jonathan/anaconda/lib/python2.7/site-packages/xlrd/__init__.pyc in open_workbook(filename, logfile, verbosity, use_mmap, file_contents, encoding_override, formatting_info, on_demand, ragged_rows)
    433         formatting_info=formatting_info,
    434         on_demand=on_demand,
--> 435         ragged_rows=ragged_rows,
    436         )
    437     return bk

/Users/jonathan/anaconda/lib/python2.7/site-packages/xlrd/book.pyc in open_workbook_xls(filename, logfile, verbosity, use_mmap, file_contents, encoding_override, formatting_info, on_demand, ragged_rows)
     89         t1 = time.clock()
     90         bk.load_time_stage_1 = t1 - t0
---> 91         biff_version = bk.getbof(XL_WORKBOOK_GLOBALS)
     92         if not biff_version:
     93             raise XLRDError("Can't determine file's BIFF version")

/Users/jonathan/anaconda/lib/python2.7/site-packages/xlrd/book.pyc in getbof(self, rqd_stream)
   1228             bof_error('Expected BOF record; met end of file')
   1229         if opcode not in bofcodes:
-> 1230             bof_error('Expected BOF record; found %r' % self.mem[savpos:savpos+8])
   1231         length = self.get2bytes()
   1232         if length == MY_EOF:

/Users/jonathan/anaconda/lib/python2.7/site-packages/xlrd/book.pyc in bof_error(msg)
   1222         if DEBUG: print("reqd: 0x%04x" % rqd_stream, file=self.logfile)
   1223         def bof_error(msg):
-> 1224             raise XLRDError('Unsupported format, or corrupt file: ' + msg)
   1225         savpos = self._position
   1226         opcode = self.get2bytes()

XLRDError: Unsupported format, or corrupt file: Expected BOF record; found '<?xml ve'

And I have tried various other things:我还尝试了其他各种方法:

df = pd.ExcelFile("coalpublic2012.xls", encoding_override='cp1252')
import xlrd
wb = xlrd.open_workbook("coalpublic2012.xls")

to no avail.无济于事。 My pandas version: 0.17.0我的pandas版本:0.17.0

I've also submitted this as a bug to the pandas github issues list.我还将此作为错误提交到 pandas github问题列表。

You can convert this Excel XML file programmatically.您可以以编程方式转换此 Excel XML 文件。 Requirement: only python and pandas.要求:只有python和pandas。

import pandas as pd
from xml.sax import ContentHandler, parse

# Reference https://goo.gl/KaOBG3
class ExcelHandler(ContentHandler):
    def __init__(self):
        self.chars = [  ]
        self.cells = [  ]
        self.rows = [  ]
        self.tables = [  ]
    def characters(self, content):
    def startElement(self, name, atts):
        if name=="Cell":
            self.chars = [  ]
        elif name=="Row":
            self.cells=[  ]
        elif name=="Table":
            self.rows = [  ]
    def endElement(self, name):
        if name=="Cell":
        elif name=="Row":
        elif name=="Table":

excelHandler = ExcelHandler()
parse('coalpublic2012.xls', excelHandler)
df1 = pd.DataFrame(excelHandler.tables[0][4:], columns=excelHandler.tables[0][3])

The problem is that while the 2013 data is an actual Excel file, the 2012 data is an XML document, something which seems to not be supported in Python.问题是,虽然 2013 年的数据是一个实际的 Excel 文件,但 2012 年的数据是一个 XML 文档,这在 Python 中似乎不受支持。 I would say your best bet is to open it in Excel, and save a copy as either a proper Excel file, or as a CSV.我想说最好的办法是在 Excel 中打开它,然后将副本另存为正确的 Excel 文件或 CSV。

You can convert this Excel XML file programmatically.您可以以编程方式转换此 Excel XML 文件。 Requirement: Windows, Office installed.要求:安装了Windows、Office。

1.Create in Notepad ExcelToCsv.vbs script: 1.在记事本中创建ExcelToCsv.vbs脚本:

if WScript.Arguments.Count < 3 Then
    WScript.Echo "Please specify the source and the destination files. Usage: ExcelToCsv <xls/xlsx source file> <csv destination file> <worksheet number (starts at 1)>"
End If

csv_format = 6

Set objFSO = CreateObject("Scripting.FileSystemObject")

src_file = objFSO.GetAbsolutePathName(Wscript.Arguments.Item(0))
dest_file = objFSO.GetAbsolutePathName(WScript.Arguments.Item(1))
worksheet_number = CInt(WScript.Arguments.Item(2))

Dim oExcel
Set oExcel = CreateObject("Excel.Application")

Dim oBook
Set oBook = oExcel.Workbooks.Open(src_file)

oBook.SaveAs dest_file, csv_format

oBook.Close False
  1. Convert the Excel XML file in CSV:将 Excel XML 文件转换为 CSV:

$ cscript ExcelToCsv.vbs coalpublic2012.xls coalpublic2012.csv 1

  1. Open the CSV file with pandas用 Pandas 打开 CSV 文件

>>> df1 = pd.read_csv('coalpublic2012.csv', skiprows=3)

Reference: Faster way to read Excel files to pandas dataframe参考: 将 Excel 文件读取到 Pandas 数据框的更快方法

Here is my update of @jrovegno's approach (which is copied from "Python Cookbook 2nd Edition"), because that code was adding whitespace to my header row and not generic enough:这是我对@jrovegno 方法的更新(从“Python Cookbook 2nd Edition”复制而来),因为该代码将空格添加到我的 header 行并且不够通用:

import pandas as pd
from xml.sax import ContentHandler, parse

class ExcelXMLHandler(ContentHandler):
    def __init__(self):
        self.tables = []
        self.chars = []

    def characters(self, content):

    def startElement(self, name, attrs):
        if name == "Table":
            self.rows = []
        elif name == "Row":
            self.cells = []
        elif name == "Data":
            self.chars = []

    def endElement(self, name):
        if name == "Table":
        elif name == "Row":
        elif name == "Data":

def xml_to_dfs(path):
    """Read Excel XML file at path and return list of DataFrames"""
    exh = ExcelXMLHandler()
    parse(path, exh)
    return [pd.DataFrame(table[1:], columns=table[0]) for table in exh.tables]

Basically, my XML appears to be structured like this:基本上,我的 XML 的结构如下:

                <Data>  # appears redundant with <Cell>

@JBWhitmore I have run the following code: @JBWhitmore 我运行了以下代码:

import pandas as pd
#Read and write to excel
dataFileUrl = r"/Users/stutiverma/Downloads/coalpublic2012.xls"
data = pd.read_table(dataFileUrl)

This reads the file successfully without giving any error.这将成功读取文件而不会出现任何错误。 But, it gives all the data in the exact format as mentioned.但是,它以上述确切格式提供所有数据。 So, you may have to do extra efforts in order to process the data after reading it successfully.因此,您可能需要付出额外的努力才能在成功读取数据后对其进行处理。

