简体   繁体   English

使用 xlrd 打开 Excel 文件时出现编码错误

[英]Encoding error when opening an Excel file with xlrd

I am trying to open an Excel file (.xls) using xlrd.我正在尝试使用 xlrd 打开 Excel 文件 (.xls)。 This is a summary of the code I am using:这是我正在使用的代码的摘要:

import xlrd
workbook = xlrd.open_workbook('thefile.xls')

This works for most files, but fails for files I get from a specific organization.这适用于大多数文件,但不适用于我从特定组织获得的文件。 The error I get when I try to open Excel files from this organization follows.当我尝试从该组织打开 Excel 文件时出现的错误如下。

Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "/app/.heroku/python/lib/python2.7/site-packages/xlrd/__init__.py", line 435, in open_workbook
    ragged_rows=ragged_rows,
  File "/app/.heroku/python/lib/python2.7/site-packages/xlrd/book.py", line 116, in open_workbook_xls
    bk.parse_globals()
  File "/app/.heroku/python/lib/python2.7/site-packages/xlrd/book.py", line 1180, in parse_globals
    self.handle_writeaccess(data)
  File "/app/.heroku/python/lib/python2.7/site-packages/xlrd/book.py", line 1145, in handle_writeaccess
    strg = unpack_unicode(data, 0, lenlen=2)
  File "/app/.heroku/python/lib/python2.7/site-packages/xlrd/biffh.py", line 303, in unpack_unicode
    strg = unicode(rawstrg, 'utf_16_le')
  File "/app/.heroku/python/lib/python2.7/encodings/utf_16_le.py", line 16, in decode
    return codecs.utf_16_le_decode(input, errors, True)
UnicodeDecodeError: 'utf16' codec can't decode byte 0x40 in position 104: truncated data

This looks as if xlrd is trying to open an Excel file encoded in something other than UTF-16.这看起来好像 xlrd 试图打开一个用非 UTF-16 编码的 Excel 文件。 How can I avoid this error?我怎样才能避免这个错误? Is the file being written in a flawed way, or is there just a specific character that is causing the problem?文件是否以有缺陷的方式写入,或者是否只是导致问题的特定字符? If I open and re-save the Excel file, xlrd opens the file without a problem.如果我打开并重新保存 Excel 文件,xlrd 可以毫无问题地打开文件。

I have tried opening the workbook with different encoding overrides but this doesn't work either.我试过用不同的编码覆盖打开工作簿,但这也不起作用。

The file I am trying to open is available here:我试图打开的文件在这里可用:

https://dl.dropboxusercontent.com/u/6779408/Stackoverflow/AEPUsageHistoryDetail_RequestID_00183816.xls https://dl.dropboxusercontent.com/u/6779408/Stackoverflow/AEPUsageHistoryDe​​tail_RequestID_00183816.xls

Issue reported here: https://github.com/python-excel/xlrd/issues/128此处报告的问题: https : //github.com/python-excel/xlrd/issues/128

What are they using to generate that file ?他们用什么来生成那个文件?

They are using some Java Excel API (see below, link here ), probably on an IBM mainframe or similar.他们正在使用一些 Java Excel API(见下文,此处链接),可能在 IBM 大型机或类似主机上。

From the stack trace the writeaccess information can't decoding into Unicode because the @ character.从堆栈跟踪中,写访问信息无法解码为 Unicode,因为 @ 字符。

For more information on the writeaccess information of the XLS fileformat see 5.112 WRITEACCESS or Page 277 .有关 XLS 文件格式的写访问信息的更多信息,请参阅5.112 WRITEACCESS第 277 页

This field contains the username of the user that has saved the file.此字段包含保存文件的用户的用户名。

import xlrd
dump = xlrd.dump('thefile.xls')

Running xlrd.dump on the original file gives在原始文件上运行 xlrd.dump 给出

   36: 005c WRITEACCESS len = 0070 (112)
   40:      d1 81 a5 81 40 c5 a7 83 85 93 40 c1 d7 c9 40 40  ????@?????@???@@
   56:      40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40  @@@@@@@@@@@@@@@@
   72:      40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40  @@@@@@@@@@@@@@@@
   88:      40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40  @@@@@@@@@@@@@@@@
  104:      40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40  @@@@@@@@@@@@@@@@
  120:      40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40  @@@@@@@@@@@@@@@@
  136:      40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40  @@@@@@@@@@@@@@@@

After resaving it with Excel or in my case LibreOffice Calc the write access information is overwritten with something like使用 Excel 或在我的情况下使用 LibreOffice Calc 重新保存后,写入访问信息将被类似的内容覆盖

 36: 005c WRITEACCESS len = 0070 (112)
 40:      04 00 00 43 61 6c 63 20 20 20 20 20 20 20 20 20  ?~~Calc         
 56:      20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20                  
 72:      20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20                  
 88:      20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20                  
104:      20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20                  
120:      20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20                  
136:      20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20

Based on the spaces being encoded as 40, I believe the encoding is EBCDIC, and when we convert d1 81 a5 81 40 c5 a7 83 85 93 40 c1 d7 c9 40 40 to EBCDIC we get Java Excel API .根据编码为 40 的空格,我相信编码是 EBCDIC,当我们将d1 81 a5 81 40 c5 a7 83 85 93 40 c1 d7 c9 40 40为 EBCDIC 时,我们得到Java Excel API

So yes the file is being written in a flawed way in the case of BIFF8 and higher it should be a unicode string, and in BIFF3 to BIFF5, it should be a byte string in the encoding in the CODEPAGE information which is所以是的,在 BIFF8 及更高版本的情况下,文件以有缺陷的方式写入,它应该是一个 unicode 字符串,而在 BIFF3 到 BIFF5 中,它应该是 CODEPAGE 信息中编码中的字节字符串

 152: 0042 CODEPAGE len = 0002 (2)
 156:      12 52                                            ?R

1252 is Windows CP-1252 (Latin I) (BIFF4-BIFF5), which is not EBCDIC_037 . 1252 是 Windows CP-1252 (Latin I) (BIFF4-BIFF5),它不是EBCDIC_037

The fact the xlrd tried to use unicode, means that it determined the version of the file to be BIFF8. xlrd 尝试使用 unicode 的事实意味着它确定文件的版本为 BIFF8。

In this case, you have two options在这种情况下,您有两个选择

  1. Fix the file before opening it with xlrd.在使用 xlrd 打开文件之前修复文件。 You could check using dump to a file that isn't standard out, and then if it is the case, you can overwrite the writeaccess information with xlutils.save or another library.您可以使用转储检查非标准输出的文件,然后如果是这种情况,您可以使用 xlutils.save 或其他库覆盖写访问信息。

  2. Patch xlrd to handle your special case, in handle_writeaccess adding a try block and setting strg to empty string on unpack_unicode failure.修补xlrd以处理您的特殊情况,在handle_writeaccess添加一个 try 块并将 strg 设置为 unpack_unicode 失败时的空字符串。

The following snippet以下片段

 def handle_writeaccess(self, data):
        DEBUG = 0
        if self.biff_version < 80:
            if not self.encoding:
                self.raw_user_name = True
                self.user_name = data
                return
            strg = unpack_string(data, 0, self.encoding, lenlen=1)
        else:
            try:
                strg = unpack_unicode(data, 0, lenlen=2)
            except:
                strg = ""
        if DEBUG: fprintf(self.logfile, "WRITEACCESS: %d bytes; raw=%s %r\n", len(data), self.raw_user_name, strg)
        strg = strg.rstrip()
        self.user_name = strg

with

workbook=xlrd.open_workbook('thefile.xls',encoding_override="cp1252")

Seems to open the file successfully.似乎成功打开文件。

Without the encoding override it complains ERROR *** codepage 21010 -> encoding 'unknown_codepage_21010' -> LookupError: unknown encoding: unknown_codepage_21010如果没有编码覆盖,它会抱怨ERROR *** codepage 21010 -> encoding 'unknown_codepage_21010' -> LookupError: unknown encoding: unknown_codepage_21010

This worked for me.这对我有用。

import xlrd

my_xls = xlrd.open_workbook('//myshareddrive/something/test.xls',encoding_override="gb2312")

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM