使用 xlrd 打開 Excel 文件時出現編碼錯誤

Question

我正在嘗試使用 xlrd 打開 Excel 文件 (.xls)。 這是我正在使用的代碼的摘要：

import xlrd
workbook = xlrd.open_workbook('thefile.xls')

這適用於大多數文件，但不適用於我從特定組織獲得的文件。 當我嘗試從該組織打開 Excel 文件時出現的錯誤如下。

Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "/app/.heroku/python/lib/python2.7/site-packages/xlrd/__init__.py", line 435, in open_workbook
    ragged_rows=ragged_rows,
  File "/app/.heroku/python/lib/python2.7/site-packages/xlrd/book.py", line 116, in open_workbook_xls
    bk.parse_globals()
  File "/app/.heroku/python/lib/python2.7/site-packages/xlrd/book.py", line 1180, in parse_globals
    self.handle_writeaccess(data)
  File "/app/.heroku/python/lib/python2.7/site-packages/xlrd/book.py", line 1145, in handle_writeaccess
    strg = unpack_unicode(data, 0, lenlen=2)
  File "/app/.heroku/python/lib/python2.7/site-packages/xlrd/biffh.py", line 303, in unpack_unicode
    strg = unicode(rawstrg, 'utf_16_le')
  File "/app/.heroku/python/lib/python2.7/encodings/utf_16_le.py", line 16, in decode
    return codecs.utf_16_le_decode(input, errors, True)
UnicodeDecodeError: 'utf16' codec can't decode byte 0x40 in position 104: truncated data

這看起來好像 xlrd 試圖打開一個用非 UTF-16 編碼的 Excel 文件。 我怎樣才能避免這個錯誤？ 文件是否以有缺陷的方式寫入，或者是否只是導致問題的特定字符？ 如果我打開並重新保存 Excel 文件，xlrd 可以毫無問題地打開文件。

我試過用不同的編碼覆蓋打開工作簿，但這也不起作用。

我試圖打開的文件在這里可用：

https://dl.dropboxusercontent.com/u/6779408/Stackoverflow/AEPUsageHistoryDetail_RequestID_00183816.xls

此處報告的問題： https : //github.com/python-excel/xlrd/issues/128

Answer 1

他們用什么來生成那個文件？

他們正在使用一些 Java Excel API（見下文，此處鏈接），可能在 IBM 大型機或類似主機上。

從堆棧跟蹤中，寫訪問信息無法解碼為 Unicode，因為 @ 字符。

有關 XLS 文件格式的寫訪問信息的更多信息，請參閱5.112 WRITEACCESS或第 277 頁。

此字段包含保存文件的用戶的用戶名。

import xlrd
dump = xlrd.dump('thefile.xls')

在原始文件上運行 xlrd.dump 給出

   36: 005c WRITEACCESS len = 0070 (112)
   40:      d1 81 a5 81 40 c5 a7 83 85 93 40 c1 d7 c9 40 40  ????@?????@???@@
   56:      40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40  @@@@@@@@@@@@@@@@
   72:      40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40  @@@@@@@@@@@@@@@@
   88:      40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40  @@@@@@@@@@@@@@@@
  104:      40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40  @@@@@@@@@@@@@@@@
  120:      40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40  @@@@@@@@@@@@@@@@
  136:      40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40  @@@@@@@@@@@@@@@@

使用 Excel 或在我的情況下使用 LibreOffice Calc 重新保存后，寫入訪問信息將被類似的內容覆蓋

 36: 005c WRITEACCESS len = 0070 (112)
 40:      04 00 00 43 61 6c 63 20 20 20 20 20 20 20 20 20  ?~~Calc         
 56:      20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20                  
 72:      20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20                  
 88:      20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20                  
104:      20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20                  
120:      20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20                  
136:      20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20

根據編碼為 40 的空格，我相信編碼是 EBCDIC，當我們將d1 81 a5 81 40 c5 a7 83 85 93 40 c1 d7 c9 40 40為 EBCDIC 時，我們得到Java Excel API 。

所以是的，在 BIFF8 及更高版本的情況下，文件以有缺陷的方式寫入，它應該是一個 unicode 字符串，而在 BIFF3 到 BIFF5 中，它應該是 CODEPAGE 信息中編碼中的字節字符串

 152: 0042 CODEPAGE len = 0002 (2)
 156:      12 52                                            ?R

1252 是 Windows CP-1252 (Latin I) (BIFF4-BIFF5)，它不是EBCDIC_037 。

xlrd 嘗試使用 unicode 的事實意味着它確定文件的版本為 BIFF8。

在這種情況下，您有兩個選擇

在使用 xlrd 打開文件之前修復文件。 您可以使用轉儲檢查非標准輸出的文件，然后如果是這種情況，您可以使用 xlutils.save 或其他庫覆蓋寫訪問信息。
修補xlrd以處理您的特殊情況，在handle_writeaccess添加一個 try 塊並將 strg 設置為 unpack_unicode 失敗時的空字符串。

以下片段

 def handle_writeaccess(self, data):
        DEBUG = 0
        if self.biff_version < 80:
            if not self.encoding:
                self.raw_user_name = True
                self.user_name = data
                return
            strg = unpack_string(data, 0, self.encoding, lenlen=1)
        else:
            try:
                strg = unpack_unicode(data, 0, lenlen=2)
            except:
                strg = ""
        if DEBUG: fprintf(self.logfile, "WRITEACCESS: %d bytes; raw=%s %r\n", len(data), self.raw_user_name, strg)
        strg = strg.rstrip()
        self.user_name = strg

和

workbook=xlrd.open_workbook('thefile.xls',encoding_override="cp1252")

似乎成功打開文件。

如果沒有編碼覆蓋，它會抱怨ERROR *** codepage 21010 -> encoding 'unknown_codepage_21010' -> LookupError: unknown encoding: unknown_codepage_21010

Answer 2

這對我有用。

import xlrd

my_xls = xlrd.open_workbook('//myshareddrive/something/test.xls',encoding_override="gb2312")

使用 xlrd 打開 Excel 文件時出現編碼錯誤

問題描述

2 個解決方案

解決方案1
12 已采納 2015-02-07 11:18:37

解決方案2
0 2019-11-15 21:32:17

使用 xlrd 打開 Excel 文件時出現編碼錯誤

問題描述

2 個解決方案

解決方案1 12 已采納 2015-02-07 11:18:37

解決方案2 0 2019-11-15 21:32:17

解決方案1
12 已采納 2015-02-07 11:18:37

解決方案2
0 2019-11-15 21:32:17