简体   繁体   English

用pandas读取Excel XML.xls文件

[英]Read Excel XML .xls file with pandas

I'm aware of a number of previously asked questions, but none of the solutions given work on the reproducible example that I provide below.我知道许多以前提出的问题,但没有一个解决方案适用于我在下面提供的可重现示例。

I am trying to read in .xls files from http://www.eia.gov/coal/data.cfm#production -- specifically the Historical detailed coal production data (1983-2013) coalpublic2012.xls file that's freely available via the dropdown.我正在尝试从http://www.eia.gov/coal/data.cfm#production读取.xls文件——特别是历史详细煤炭生产数据 (1983-2013) coalpublic2012.xls文件,可通过以下方式免费获得落下。 Pandas cannot read it. Pandas 无法读取。

In contrast, the file for the most recent year available, 2013, coalpublic2013.xls file, works without a problem:相比之下,最近一年可用的文件 2013 年的coalpublic2013.xls文件可以正常工作:

import pandas as pd
df1 = pd.read_excel("coalpublic2013.xls")

but the next decade of .xls files (2004-2012) do not load.但下一个十年的.xls文件(2004-2012)不会加载。 I have looked at these files with Excel, and they open, and are not corrupted.我用 Excel 查看了这些文件,它们打开了,并且没有损坏。

The error that I get from pandas is:我从 pandas 得到的错误是:

---------------------------------------------------------------------------
XLRDError                                 Traceback (most recent call last)
<ipython-input-28-0da33766e9d2> in <module>()
----> 1 df = pd.read_excel("coalpublic2012.xlsx")

/Users/jonathan/anaconda/lib/python2.7/site-packages/pandas/io/excel.pyc in read_excel(io, sheetname, header, skiprows, skip_footer, index_col, parse_cols, parse_dates, date_parser, na_values, thousands, convert_float, has_index_names, converters, engine, **kwds)
    161 
    162     if not isinstance(io, ExcelFile):
--> 163         io = ExcelFile(io, engine=engine)
    164 
    165     return io._parse_excel(

/Users/jonathan/anaconda/lib/python2.7/site-packages/pandas/io/excel.pyc in __init__(self, io, **kwds)
    204                 self.book = xlrd.open_workbook(file_contents=data)
    205             else:
--> 206                 self.book = xlrd.open_workbook(io)
    207         elif engine == 'xlrd' and isinstance(io, xlrd.Book):
    208             self.book = io

/Users/jonathan/anaconda/lib/python2.7/site-packages/xlrd/__init__.pyc in open_workbook(filename, logfile, verbosity, use_mmap, file_contents, encoding_override, formatting_info, on_demand, ragged_rows)
    433         formatting_info=formatting_info,
    434         on_demand=on_demand,
--> 435         ragged_rows=ragged_rows,
    436         )
    437     return bk

/Users/jonathan/anaconda/lib/python2.7/site-packages/xlrd/book.pyc in open_workbook_xls(filename, logfile, verbosity, use_mmap, file_contents, encoding_override, formatting_info, on_demand, ragged_rows)
     89         t1 = time.clock()
     90         bk.load_time_stage_1 = t1 - t0
---> 91         biff_version = bk.getbof(XL_WORKBOOK_GLOBALS)
     92         if not biff_version:
     93             raise XLRDError("Can't determine file's BIFF version")

/Users/jonathan/anaconda/lib/python2.7/site-packages/xlrd/book.pyc in getbof(self, rqd_stream)
   1228             bof_error('Expected BOF record; met end of file')
   1229         if opcode not in bofcodes:
-> 1230             bof_error('Expected BOF record; found %r' % self.mem[savpos:savpos+8])
   1231         length = self.get2bytes()
   1232         if length == MY_EOF:

/Users/jonathan/anaconda/lib/python2.7/site-packages/xlrd/book.pyc in bof_error(msg)
   1222         if DEBUG: print("reqd: 0x%04x" % rqd_stream, file=self.logfile)
   1223         def bof_error(msg):
-> 1224             raise XLRDError('Unsupported format, or corrupt file: ' + msg)
   1225         savpos = self._position
   1226         opcode = self.get2bytes()

XLRDError: Unsupported format, or corrupt file: Expected BOF record; found '<?xml ve'

And I have tried various other things:我还尝试了其他各种方法:

df = pd.ExcelFile("coalpublic2012.xls", encoding_override='cp1252')
import xlrd
wb = xlrd.open_workbook("coalpublic2012.xls")

to no avail.无济于事。 My pandas version: 0.17.0我的pandas版本:0.17.0

I've also submitted this as a bug to the pandas github issues list.我还将此作为错误提交到 pandas github问题列表。

You can convert this Excel XML file programmatically.您可以以编程方式转换此 Excel XML 文件。 Requirement: only python and pandas.要求:只有python和pandas。

import pandas as pd
from xml.sax import ContentHandler, parse

# Reference https://goo.gl/KaOBG3
class ExcelHandler(ContentHandler):
    def __init__(self):
        self.chars = [  ]
        self.cells = [  ]
        self.rows = [  ]
        self.tables = [  ]
    def characters(self, content):
        self.chars.append(content)
    def startElement(self, name, atts):
        if name=="Cell":
            self.chars = [  ]
        elif name=="Row":
            self.cells=[  ]
        elif name=="Table":
            self.rows = [  ]
    def endElement(self, name):
        if name=="Cell":
            self.cells.append(''.join(self.chars))
        elif name=="Row":
            self.rows.append(self.cells)
        elif name=="Table":
            self.tables.append(self.rows)

excelHandler = ExcelHandler()
parse('coalpublic2012.xls', excelHandler)
df1 = pd.DataFrame(excelHandler.tables[0][4:], columns=excelHandler.tables[0][3])

The problem is that while the 2013 data is an actual Excel file, the 2012 data is an XML document, something which seems to not be supported in Python.问题是,虽然 2013 年的数据是一个实际的 Excel 文件,但 2012 年的数据是一个 XML 文档,这在 Python 中似乎不受支持。 I would say your best bet is to open it in Excel, and save a copy as either a proper Excel file, or as a CSV.我想说最好的办法是在 Excel 中打开它,然后将副本另存为正确的 Excel 文件或 CSV。

You can convert this Excel XML file programmatically.您可以以编程方式转换此 Excel XML 文件。 Requirement: Windows, Office installed.要求:安装了Windows、Office。

1.Create in Notepad ExcelToCsv.vbs script: 1.在记事本中创建ExcelToCsv.vbs脚本:

if WScript.Arguments.Count < 3 Then
    WScript.Echo "Please specify the source and the destination files. Usage: ExcelToCsv <xls/xlsx source file> <csv destination file> <worksheet number (starts at 1)>"
    Wscript.Quit
End If

csv_format = 6

Set objFSO = CreateObject("Scripting.FileSystemObject")

src_file = objFSO.GetAbsolutePathName(Wscript.Arguments.Item(0))
dest_file = objFSO.GetAbsolutePathName(WScript.Arguments.Item(1))
worksheet_number = CInt(WScript.Arguments.Item(2))

Dim oExcel
Set oExcel = CreateObject("Excel.Application")

Dim oBook
Set oBook = oExcel.Workbooks.Open(src_file)
oBook.Worksheets(worksheet_number).Activate

oBook.SaveAs dest_file, csv_format

oBook.Close False
oExcel.Quit
  1. Convert the Excel XML file in CSV:将 Excel XML 文件转换为 CSV:

$ cscript ExcelToCsv.vbs coalpublic2012.xls coalpublic2012.csv 1

  1. Open the CSV file with pandas用 Pandas 打开 CSV 文件

>>> df1 = pd.read_csv('coalpublic2012.csv', skiprows=3)

Reference: Faster way to read Excel files to pandas dataframe参考: 将 Excel 文件读取到 Pandas 数据框的更快方法

Here is my update of @jrovegno's approach (which is copied from "Python Cookbook 2nd Edition"), because that code was adding whitespace to my header row and not generic enough:这是我对@jrovegno 方法的更新(从“Python Cookbook 2nd Edition”复制而来),因为该代码将空格添加到我的 header 行并且不够通用:

import pandas as pd
from xml.sax import ContentHandler, parse


class ExcelXMLHandler(ContentHandler):
    def __init__(self):
        self.tables = []
        self.chars = []

    def characters(self, content):
        self.chars.append(content)

    def startElement(self, name, attrs):
        if name == "Table":
            self.rows = []
        elif name == "Row":
            self.cells = []
        elif name == "Data":
            self.chars = []

    def endElement(self, name):
        if name == "Table":
            self.tables.append(self.rows)
        elif name == "Row":
            self.rows.append(self.cells)
        elif name == "Data":
            self.cells.append("".join(self.chars))


def xml_to_dfs(path):
    """Read Excel XML file at path and return list of DataFrames"""
    exh = ExcelXMLHandler()
    parse(path, exh)
    return [pd.DataFrame(table[1:], columns=table[0]) for table in exh.tables]

Basically, my XML appears to be structured like this:基本上,我的 XML 的结构如下:

<Worksheet>
    <Table>
        <Row>
            <Cell>
                <Data>  # appears redundant with <Cell>

@JBWhitmore I have run the following code: @JBWhitmore 我运行了以下代码:

import pandas as pd
#Read and write to excel
dataFileUrl = r"/Users/stutiverma/Downloads/coalpublic2012.xls"
data = pd.read_table(dataFileUrl)

This reads the file successfully without giving any error.这将成功读取文件而不会出现任何错误。 But, it gives all the data in the exact format as mentioned.但是,它以上述确切格式提供所有数据。 So, you may have to do extra efforts in order to process the data after reading it successfully.因此,您可能需要付出额外的努力才能在成功读取数据后对其进行处理。

读取 pandas / python 中的 xls 文件:不支持的格式,或损坏的文件:预期的 BOF 记录; 找到 b'\xef\xbb\xbf <!--?xml'</div--><div id="text_translate"><p> 我正在尝试将xls文件(只有一个选项卡)打开到 pandas dataframe 中。</p><p> It is a file that i can normally read in excel or excel for the web, in fact here is the raw file itself: <a href="https://www.dropbox.com/scl/fi/zbxg8ymjp8zxo6k4an4dj/product-screener.xls?dl=0&amp;rlkey=3aw7whab78jeexbdkthkjzkmu" rel="nofollow noreferrer">https://www.dropbox.com/scl/fi/zbxg8ymjp8zxo6k4an4dj/product-screener.xls?dl= 0&amp;rlkey=3aw7whab78jeexbdkthkjzkmu</a> 。</p><p> 我注意到前两行合并了单元格,一些列也是如此。</p><p> 我尝试了几种方法(来自堆栈),但都失败了。</p><pre> # method 1 - read excel file = "C:\\Users\\admin\\Downloads\\product-screener.xls" df = pd.read_excel(file) print(df)</pre><p> 错误: Excel file format cannot be determined, you must specify an engine manually.</p><pre> # method 2 - pip install xlrd and use engine file = "C:\\Users\\admin\\Downloads\\product-screener.xls" df = pd.read_excel(file, engine='xlrd') print(df)</pre><p> 错误: Unsupported format, or corrupt file: Expected BOF record; found b'\xef\xbb\xbf&lt;?xml' Unsupported format, or corrupt file: Expected BOF record; found b'\xef\xbb\xbf&lt;?xml'</p><pre> # method 3 - rename to xlsx and open with openpyxl file = "C:\\Users\\admin\\Downloads\\product-screener.xlsx" df = pd.read_excel(file, engine='openpyxl') print(df)</pre><p> 错误: File is not a zip file (可以选择转换,而不是重命名)。</p><pre> # method 4 - use read_xml file = "C:\\Users\\admin\\Downloads\\product-screener.xls" df = pd.read_xml(file) print(df)</pre><p> 此方法实际上会产生结果,但会产生与工作表没有任何意义的 DataFrame。 大概需要解释 xml (似乎很复杂)?</p><pre> Style Name Table 0 NaN None NaN 1 NaN All funds NaN # method 5 - use read_table file = "C:\\Users\\admin\\Downloads\\product-screener.xls" df = pd.read_table(file) print(df)</pre><p> 此方法将文件读入一列(系列)DataFrame。 那么如何使用这些信息来创建与 xls 文件形状相同的标准 2d DataFrame 呢?</p><pre> 0 &lt;Workbook xmlns="urn:schemas-microsoft-com:off... 1 &lt;Styles&gt; 2 &lt;Style ss:ID="Default"&gt; 3 &lt;Alignment Horizontal="Left"/&gt; 4 &lt;/Style&gt;... ... 226532 &lt;/Cell&gt; 226533 &lt;/Row&gt; 226534 &lt;/Table&gt; 226535 &lt;/Worksheet&gt; 226536 &lt;/Workbook&gt; # method 5 - use read_html file = "C:\\Users\\admin\\Downloads\\product-screener.xls" df = pd.read_html(file) print(df)</pre><p> 这将返回一个空白列表[] ,而人们可能期望至少有一个 DataFrame 列表。</p><p> 所以问题是将这个文件读入 dataframe (或类似的可用格式)的最简单方法是什么?</p></div> - read xls file in pandas / python: Unsupported format, or corrupt file: Expected BOF record; found b'\xef\xbb\xbf<?xml'

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 在python中读取.xls文件(使用pandas read_excel) - Reading an .xls file in python (using pandas read_excel) 为什么 pandas read_excel 不能正确读取 xls 文件? - Why pandas read_excel not reading correctly xls file? 阅读MS Excel XML文件到pandas dataframe? - Read MS Excel XML file to pandas dataframe? Pandas read_excel(格式 xls) - Pandas read_excel (format xls) 在 Pandas 中解析 xml-xls 文件 - Parsing xml-xls file in pandas 使用 pandas read_excel() 将.xls 文件格式导入 python 时出现 CompDocError - CompDocError when importing .xls file format to python using pandas read_excel() Python:Pandas read_excel 无法打开.xls 文件,不支持 xlrd - Python: Pandas read_excel cannot open .xls file, xlrd not supported 在python中读取包含xml数据的xls文件 - Read xls file containing xml data in python 使用 pandas 读取 xml 文件 - Read xml file with pandas 读取 pandas / python 中的 xls 文件:不支持的格式,或损坏的文件:预期的 BOF 记录; 找到 b'\xef\xbb\xbf <!--?xml'</div--><div id="text_translate"><p> 我正在尝试将xls文件(只有一个选项卡)打开到 pandas dataframe 中。</p><p> It is a file that i can normally read in excel or excel for the web, in fact here is the raw file itself: <a href="https://www.dropbox.com/scl/fi/zbxg8ymjp8zxo6k4an4dj/product-screener.xls?dl=0&amp;rlkey=3aw7whab78jeexbdkthkjzkmu" rel="nofollow noreferrer">https://www.dropbox.com/scl/fi/zbxg8ymjp8zxo6k4an4dj/product-screener.xls?dl= 0&amp;rlkey=3aw7whab78jeexbdkthkjzkmu</a> 。</p><p> 我注意到前两行合并了单元格,一些列也是如此。</p><p> 我尝试了几种方法(来自堆栈),但都失败了。</p><pre> # method 1 - read excel file = "C:\\Users\\admin\\Downloads\\product-screener.xls" df = pd.read_excel(file) print(df)</pre><p> 错误: Excel file format cannot be determined, you must specify an engine manually.</p><pre> # method 2 - pip install xlrd and use engine file = "C:\\Users\\admin\\Downloads\\product-screener.xls" df = pd.read_excel(file, engine='xlrd') print(df)</pre><p> 错误: Unsupported format, or corrupt file: Expected BOF record; found b'\xef\xbb\xbf&lt;?xml' Unsupported format, or corrupt file: Expected BOF record; found b'\xef\xbb\xbf&lt;?xml'</p><pre> # method 3 - rename to xlsx and open with openpyxl file = "C:\\Users\\admin\\Downloads\\product-screener.xlsx" df = pd.read_excel(file, engine='openpyxl') print(df)</pre><p> 错误: File is not a zip file (可以选择转换,而不是重命名)。</p><pre> # method 4 - use read_xml file = "C:\\Users\\admin\\Downloads\\product-screener.xls" df = pd.read_xml(file) print(df)</pre><p> 此方法实际上会产生结果,但会产生与工作表没有任何意义的 DataFrame。 大概需要解释 xml (似乎很复杂)?</p><pre> Style Name Table 0 NaN None NaN 1 NaN All funds NaN # method 5 - use read_table file = "C:\\Users\\admin\\Downloads\\product-screener.xls" df = pd.read_table(file) print(df)</pre><p> 此方法将文件读入一列(系列)DataFrame。 那么如何使用这些信息来创建与 xls 文件形状相同的标准 2d DataFrame 呢?</p><pre> 0 &lt;Workbook xmlns="urn:schemas-microsoft-com:off... 1 &lt;Styles&gt; 2 &lt;Style ss:ID="Default"&gt; 3 &lt;Alignment Horizontal="Left"/&gt; 4 &lt;/Style&gt;... ... 226532 &lt;/Cell&gt; 226533 &lt;/Row&gt; 226534 &lt;/Table&gt; 226535 &lt;/Worksheet&gt; 226536 &lt;/Workbook&gt; # method 5 - use read_html file = "C:\\Users\\admin\\Downloads\\product-screener.xls" df = pd.read_html(file) print(df)</pre><p> 这将返回一个空白列表[] ,而人们可能期望至少有一个 DataFrame 列表。</p><p> 所以问题是将这个文件读入 dataframe (或类似的可用格式)的最简单方法是什么?</p></div> - read xls file in pandas / python: Unsupported format, or corrupt file: Expected BOF record; found b'\xef\xbb\xbf<?xml'
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM