[英]Read Excel XML .xls file with pandas
I'm aware of a number of previously asked questions, but none of the solutions given work on the reproducible example that I provide below.我知道许多以前提出的问题,但没有一个解决方案适用于我在下面提供的可重现示例。
I am trying to read in .xls
files from http://www.eia.gov/coal/data.cfm#production -- specifically the Historical detailed coal production data (1983-2013) coalpublic2012.xls
file that's freely available via the dropdown.我正在尝试从http://www.eia.gov/coal/data.cfm#production读取
.xls
文件——特别是历史详细煤炭生产数据 (1983-2013) coalpublic2012.xls
文件,可通过以下方式免费获得落下。 Pandas cannot read it. Pandas 无法读取。
In contrast, the file for the most recent year available, 2013, coalpublic2013.xls
file, works without a problem:相比之下,最近一年可用的文件 2013 年的
coalpublic2013.xls
文件可以正常工作:
import pandas as pd
df1 = pd.read_excel("coalpublic2013.xls")
but the next decade of .xls
files (2004-2012) do not load.但下一个十年的
.xls
文件(2004-2012)不会加载。 I have looked at these files with Excel, and they open, and are not corrupted.我用 Excel 查看了这些文件,它们打开了,并且没有损坏。
The error that I get from pandas is:我从 pandas 得到的错误是:
---------------------------------------------------------------------------
XLRDError Traceback (most recent call last)
<ipython-input-28-0da33766e9d2> in <module>()
----> 1 df = pd.read_excel("coalpublic2012.xlsx")
/Users/jonathan/anaconda/lib/python2.7/site-packages/pandas/io/excel.pyc in read_excel(io, sheetname, header, skiprows, skip_footer, index_col, parse_cols, parse_dates, date_parser, na_values, thousands, convert_float, has_index_names, converters, engine, **kwds)
161
162 if not isinstance(io, ExcelFile):
--> 163 io = ExcelFile(io, engine=engine)
164
165 return io._parse_excel(
/Users/jonathan/anaconda/lib/python2.7/site-packages/pandas/io/excel.pyc in __init__(self, io, **kwds)
204 self.book = xlrd.open_workbook(file_contents=data)
205 else:
--> 206 self.book = xlrd.open_workbook(io)
207 elif engine == 'xlrd' and isinstance(io, xlrd.Book):
208 self.book = io
/Users/jonathan/anaconda/lib/python2.7/site-packages/xlrd/__init__.pyc in open_workbook(filename, logfile, verbosity, use_mmap, file_contents, encoding_override, formatting_info, on_demand, ragged_rows)
433 formatting_info=formatting_info,
434 on_demand=on_demand,
--> 435 ragged_rows=ragged_rows,
436 )
437 return bk
/Users/jonathan/anaconda/lib/python2.7/site-packages/xlrd/book.pyc in open_workbook_xls(filename, logfile, verbosity, use_mmap, file_contents, encoding_override, formatting_info, on_demand, ragged_rows)
89 t1 = time.clock()
90 bk.load_time_stage_1 = t1 - t0
---> 91 biff_version = bk.getbof(XL_WORKBOOK_GLOBALS)
92 if not biff_version:
93 raise XLRDError("Can't determine file's BIFF version")
/Users/jonathan/anaconda/lib/python2.7/site-packages/xlrd/book.pyc in getbof(self, rqd_stream)
1228 bof_error('Expected BOF record; met end of file')
1229 if opcode not in bofcodes:
-> 1230 bof_error('Expected BOF record; found %r' % self.mem[savpos:savpos+8])
1231 length = self.get2bytes()
1232 if length == MY_EOF:
/Users/jonathan/anaconda/lib/python2.7/site-packages/xlrd/book.pyc in bof_error(msg)
1222 if DEBUG: print("reqd: 0x%04x" % rqd_stream, file=self.logfile)
1223 def bof_error(msg):
-> 1224 raise XLRDError('Unsupported format, or corrupt file: ' + msg)
1225 savpos = self._position
1226 opcode = self.get2bytes()
XLRDError: Unsupported format, or corrupt file: Expected BOF record; found '<?xml ve'
And I have tried various other things:我还尝试了其他各种方法:
df = pd.ExcelFile("coalpublic2012.xls", encoding_override='cp1252')
import xlrd
wb = xlrd.open_workbook("coalpublic2012.xls")
to no avail.无济于事。 My pandas version: 0.17.0
我的pandas版本:0.17.0
I've also submitted this as a bug to the pandas github issues list.我还将此作为错误提交到 pandas github问题列表。
You can convert this Excel XML file programmatically.您可以以编程方式转换此 Excel XML 文件。 Requirement: only python and pandas.
要求:只有python和pandas。
import pandas as pd
from xml.sax import ContentHandler, parse
# Reference https://goo.gl/KaOBG3
class ExcelHandler(ContentHandler):
def __init__(self):
self.chars = [ ]
self.cells = [ ]
self.rows = [ ]
self.tables = [ ]
def characters(self, content):
self.chars.append(content)
def startElement(self, name, atts):
if name=="Cell":
self.chars = [ ]
elif name=="Row":
self.cells=[ ]
elif name=="Table":
self.rows = [ ]
def endElement(self, name):
if name=="Cell":
self.cells.append(''.join(self.chars))
elif name=="Row":
self.rows.append(self.cells)
elif name=="Table":
self.tables.append(self.rows)
excelHandler = ExcelHandler()
parse('coalpublic2012.xls', excelHandler)
df1 = pd.DataFrame(excelHandler.tables[0][4:], columns=excelHandler.tables[0][3])
The problem is that while the 2013 data is an actual Excel file, the 2012 data is an XML document, something which seems to not be supported in Python.问题是,虽然 2013 年的数据是一个实际的 Excel 文件,但 2012 年的数据是一个 XML 文档,这在 Python 中似乎不受支持。 I would say your best bet is to open it in Excel, and save a copy as either a proper Excel file, or as a CSV.
我想说最好的办法是在 Excel 中打开它,然后将副本另存为正确的 Excel 文件或 CSV。
You can convert this Excel XML file programmatically.您可以以编程方式转换此 Excel XML 文件。 Requirement: Windows, Office installed.
要求:安装了Windows、Office。
1.Create in Notepad ExcelToCsv.vbs script: 1.在记事本中创建ExcelToCsv.vbs脚本:
if WScript.Arguments.Count < 3 Then
WScript.Echo "Please specify the source and the destination files. Usage: ExcelToCsv <xls/xlsx source file> <csv destination file> <worksheet number (starts at 1)>"
Wscript.Quit
End If
csv_format = 6
Set objFSO = CreateObject("Scripting.FileSystemObject")
src_file = objFSO.GetAbsolutePathName(Wscript.Arguments.Item(0))
dest_file = objFSO.GetAbsolutePathName(WScript.Arguments.Item(1))
worksheet_number = CInt(WScript.Arguments.Item(2))
Dim oExcel
Set oExcel = CreateObject("Excel.Application")
Dim oBook
Set oBook = oExcel.Workbooks.Open(src_file)
oBook.Worksheets(worksheet_number).Activate
oBook.SaveAs dest_file, csv_format
oBook.Close False
oExcel.Quit
$ cscript ExcelToCsv.vbs coalpublic2012.xls coalpublic2012.csv 1
>>> df1 = pd.read_csv('coalpublic2012.csv', skiprows=3)
Reference: Faster way to read Excel files to pandas dataframe参考: 将 Excel 文件读取到 Pandas 数据框的更快方法
Here is my update of @jrovegno's approach (which is copied from "Python Cookbook 2nd Edition"), because that code was adding whitespace to my header row and not generic enough:这是我对@jrovegno 方法的更新(从“Python Cookbook 2nd Edition”复制而来),因为该代码将空格添加到我的 header 行并且不够通用:
import pandas as pd
from xml.sax import ContentHandler, parse
class ExcelXMLHandler(ContentHandler):
def __init__(self):
self.tables = []
self.chars = []
def characters(self, content):
self.chars.append(content)
def startElement(self, name, attrs):
if name == "Table":
self.rows = []
elif name == "Row":
self.cells = []
elif name == "Data":
self.chars = []
def endElement(self, name):
if name == "Table":
self.tables.append(self.rows)
elif name == "Row":
self.rows.append(self.cells)
elif name == "Data":
self.cells.append("".join(self.chars))
def xml_to_dfs(path):
"""Read Excel XML file at path and return list of DataFrames"""
exh = ExcelXMLHandler()
parse(path, exh)
return [pd.DataFrame(table[1:], columns=table[0]) for table in exh.tables]
Basically, my XML appears to be structured like this:基本上,我的 XML 的结构如下:
<Worksheet>
<Table>
<Row>
<Cell>
<Data> # appears redundant with <Cell>
@JBWhitmore I have run the following code: @JBWhitmore 我运行了以下代码:
import pandas as pd
#Read and write to excel
dataFileUrl = r"/Users/stutiverma/Downloads/coalpublic2012.xls"
data = pd.read_table(dataFileUrl)
This reads the file successfully without giving any error.这将成功读取文件而不会出现任何错误。 But, it gives all the data in the exact format as mentioned.
但是,它以上述确切格式提供所有数据。 So, you may have to do extra efforts in order to process the data after reading it successfully.
因此,您可能需要付出额外的努力才能在成功读取数据后对其进行处理。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.