[英]Read Excel XML .xls file with pandas
我知道許多以前提出的問題,但沒有一個解決方案適用於我在下面提供的可重現示例。
我正在嘗試從http://www.eia.gov/coal/data.cfm#production讀取.xls
文件——特別是歷史詳細煤炭生產數據 (1983-2013) coalpublic2012.xls
文件,可通過以下方式免費獲得落下。 Pandas 無法讀取。
相比之下,最近一年可用的文件 2013 年的coalpublic2013.xls
文件可以正常工作:
import pandas as pd
df1 = pd.read_excel("coalpublic2013.xls")
但下一個十年的.xls
文件(2004-2012)不會加載。 我用 Excel 查看了這些文件,它們打開了,並且沒有損壞。
我從 pandas 得到的錯誤是:
---------------------------------------------------------------------------
XLRDError Traceback (most recent call last)
<ipython-input-28-0da33766e9d2> in <module>()
----> 1 df = pd.read_excel("coalpublic2012.xlsx")
/Users/jonathan/anaconda/lib/python2.7/site-packages/pandas/io/excel.pyc in read_excel(io, sheetname, header, skiprows, skip_footer, index_col, parse_cols, parse_dates, date_parser, na_values, thousands, convert_float, has_index_names, converters, engine, **kwds)
161
162 if not isinstance(io, ExcelFile):
--> 163 io = ExcelFile(io, engine=engine)
164
165 return io._parse_excel(
/Users/jonathan/anaconda/lib/python2.7/site-packages/pandas/io/excel.pyc in __init__(self, io, **kwds)
204 self.book = xlrd.open_workbook(file_contents=data)
205 else:
--> 206 self.book = xlrd.open_workbook(io)
207 elif engine == 'xlrd' and isinstance(io, xlrd.Book):
208 self.book = io
/Users/jonathan/anaconda/lib/python2.7/site-packages/xlrd/__init__.pyc in open_workbook(filename, logfile, verbosity, use_mmap, file_contents, encoding_override, formatting_info, on_demand, ragged_rows)
433 formatting_info=formatting_info,
434 on_demand=on_demand,
--> 435 ragged_rows=ragged_rows,
436 )
437 return bk
/Users/jonathan/anaconda/lib/python2.7/site-packages/xlrd/book.pyc in open_workbook_xls(filename, logfile, verbosity, use_mmap, file_contents, encoding_override, formatting_info, on_demand, ragged_rows)
89 t1 = time.clock()
90 bk.load_time_stage_1 = t1 - t0
---> 91 biff_version = bk.getbof(XL_WORKBOOK_GLOBALS)
92 if not biff_version:
93 raise XLRDError("Can't determine file's BIFF version")
/Users/jonathan/anaconda/lib/python2.7/site-packages/xlrd/book.pyc in getbof(self, rqd_stream)
1228 bof_error('Expected BOF record; met end of file')
1229 if opcode not in bofcodes:
-> 1230 bof_error('Expected BOF record; found %r' % self.mem[savpos:savpos+8])
1231 length = self.get2bytes()
1232 if length == MY_EOF:
/Users/jonathan/anaconda/lib/python2.7/site-packages/xlrd/book.pyc in bof_error(msg)
1222 if DEBUG: print("reqd: 0x%04x" % rqd_stream, file=self.logfile)
1223 def bof_error(msg):
-> 1224 raise XLRDError('Unsupported format, or corrupt file: ' + msg)
1225 savpos = self._position
1226 opcode = self.get2bytes()
XLRDError: Unsupported format, or corrupt file: Expected BOF record; found '<?xml ve'
我還嘗試了其他各種方法:
df = pd.ExcelFile("coalpublic2012.xls", encoding_override='cp1252')
import xlrd
wb = xlrd.open_workbook("coalpublic2012.xls")
無濟於事。 我的pandas版本:0.17.0
我還將此作為錯誤提交到 pandas github問題列表。
您可以以編程方式轉換此 Excel XML 文件。 要求:只有python和pandas。
import pandas as pd
from xml.sax import ContentHandler, parse
# Reference https://goo.gl/KaOBG3
class ExcelHandler(ContentHandler):
def __init__(self):
self.chars = [ ]
self.cells = [ ]
self.rows = [ ]
self.tables = [ ]
def characters(self, content):
self.chars.append(content)
def startElement(self, name, atts):
if name=="Cell":
self.chars = [ ]
elif name=="Row":
self.cells=[ ]
elif name=="Table":
self.rows = [ ]
def endElement(self, name):
if name=="Cell":
self.cells.append(''.join(self.chars))
elif name=="Row":
self.rows.append(self.cells)
elif name=="Table":
self.tables.append(self.rows)
excelHandler = ExcelHandler()
parse('coalpublic2012.xls', excelHandler)
df1 = pd.DataFrame(excelHandler.tables[0][4:], columns=excelHandler.tables[0][3])
問題是,雖然 2013 年的數據是一個實際的 Excel 文件,但 2012 年的數據是一個 XML 文檔,這在 Python 中似乎不受支持。 我想說最好的辦法是在 Excel 中打開它,然后將副本另存為正確的 Excel 文件或 CSV。
您可以以編程方式轉換此 Excel XML 文件。 要求:安裝了Windows、Office。
1.在記事本中創建ExcelToCsv.vbs腳本:
if WScript.Arguments.Count < 3 Then
WScript.Echo "Please specify the source and the destination files. Usage: ExcelToCsv <xls/xlsx source file> <csv destination file> <worksheet number (starts at 1)>"
Wscript.Quit
End If
csv_format = 6
Set objFSO = CreateObject("Scripting.FileSystemObject")
src_file = objFSO.GetAbsolutePathName(Wscript.Arguments.Item(0))
dest_file = objFSO.GetAbsolutePathName(WScript.Arguments.Item(1))
worksheet_number = CInt(WScript.Arguments.Item(2))
Dim oExcel
Set oExcel = CreateObject("Excel.Application")
Dim oBook
Set oBook = oExcel.Workbooks.Open(src_file)
oBook.Worksheets(worksheet_number).Activate
oBook.SaveAs dest_file, csv_format
oBook.Close False
oExcel.Quit
$ cscript ExcelToCsv.vbs coalpublic2012.xls coalpublic2012.csv 1
>>> df1 = pd.read_csv('coalpublic2012.csv', skiprows=3)
這是我對@jrovegno 方法的更新(從“Python Cookbook 2nd Edition”復制而來),因為該代碼將空格添加到我的 header 行並且不夠通用:
import pandas as pd
from xml.sax import ContentHandler, parse
class ExcelXMLHandler(ContentHandler):
def __init__(self):
self.tables = []
self.chars = []
def characters(self, content):
self.chars.append(content)
def startElement(self, name, attrs):
if name == "Table":
self.rows = []
elif name == "Row":
self.cells = []
elif name == "Data":
self.chars = []
def endElement(self, name):
if name == "Table":
self.tables.append(self.rows)
elif name == "Row":
self.rows.append(self.cells)
elif name == "Data":
self.cells.append("".join(self.chars))
def xml_to_dfs(path):
"""Read Excel XML file at path and return list of DataFrames"""
exh = ExcelXMLHandler()
parse(path, exh)
return [pd.DataFrame(table[1:], columns=table[0]) for table in exh.tables]
基本上,我的 XML 的結構如下:
<Worksheet>
<Table>
<Row>
<Cell>
<Data> # appears redundant with <Cell>
@JBWhitmore 我運行了以下代碼:
import pandas as pd
#Read and write to excel
dataFileUrl = r"/Users/stutiverma/Downloads/coalpublic2012.xls"
data = pd.read_table(dataFileUrl)
這將成功讀取文件而不會出現任何錯誤。 但是,它以上述確切格式提供所有數據。 因此,您可能需要付出額外的努力才能在成功讀取數據后對其進行處理。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.