簡體   English   中英

從python中的xls讀取unicode

[英]reading unicode from xls in python

我正在嘗試使用Python讀取.xls文件。 該文件包含多個非ascii字符(即äöü)。 我已經嘗試過使用openpyxls和xlrd(我對xlrd寄予厚望,因為它無論如何都會讀取unicode中的所有內容),但都沒有工作。

我在嘗試從xls打印信息時發現了多個處理編碼/解碼的答案,但我似乎無法達到那么遠。 只需嘗試讀取文件后,此腳本就會出錯:

import xlrd
workbook = xlrd.open_workbook('export_data.xls')

導致:

Traceback (most recent call last):
  File "C:\Users\Administrator\workspace\tufinderxlstoxml\tufinderxlstoxml2.py", line 2, in <module>
    workbook = xlrd.open_workbook('export_data.xls')
  File "C:\Python27_32\lib\site-packages\xlrd\__init__.py", line 435, in open_workbook
    ragged_rows=ragged_rows,
  File "C:\Python27_32\lib\site-packages\xlrd\book.py", line 119, in open_workbook_xls
    bk.get_sheets()
  File "C:\Python27_32\lib\site-packages\xlrd\book.py", line 705, in get_sheets
    self.get_sheet(sheetno)
  File "C:\Python27_32\lib\site-packages\xlrd\book.py", line 696, in get_sheet
    sh.read(self)
  File "C:\Python27_32\lib\site-packages\xlrd\sheet.py", line 796, in read
    strg = unpack_string(data, 6, bk.encoding or bk.derive_encoding(), lenlen=2)
  File "C:\Python27_32\lib\site-packages\xlrd\biffh.py", line 269, in unpack_string
    return unicode(data[pos:pos+nchars], encoding)
UnicodeDecodeError: 'ascii' codec can't decode byte 0x92 in position 55: ordinal not in range(128)
WARNING *** OLE2 inconsistency: SSCS size is 0 but SSAT size is non-zero
*** No CODEPAGE record, no encoding_override: will use 'ascii'
*** No CODEPAGE record, no encoding_override: will use 'ascii'

我也嘗試過:

workbook = xlrd.open_workbook('export_data.xls', encoding_override="utf-8")

導致:

Traceback (most recent call last):
  File "C:\Users\Administrator\workspace\tufinderxlstoxml\tufinderxlstoxml2.py", line 2, in <module>
    workbook = xlrd.open_workbook('export_data.xls', encoding_override="utf-8")
  File "C:\Python27_32\lib\site-packages\xlrd\__init__.py", line 435, in open_workbook
    ragged_rows=ragged_rows,
  File "C:\Python27_32\lib\site-packages\xlrd\book.py", line 119, in open_workbook_xls
    bk.get_sheets()
  File "C:\Python27_32\lib\site-packages\xlrd\book.py", line 705, in get_sheets
    self.get_sheet(sheetno)
  File "C:\Python27_32\lib\site-packages\xlrd\book.py", line 696, in get_sheet
    sh.read(self)
  File "C:\Python27_32\lib\site-packages\xlrd\sheet.py", line 796, in read
    strg = unpack_string(data, 6, bk.encoding or bk.derive_encoding(), lenlen=2)
  File "C:\Python27_32\lib\site-packages\xlrd\biffh.py", line 269, in unpack_string
    return unicode(data[pos:pos+nchars], encoding)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 55: invalid start byte
WARNING *** OLE2 inconsistency: SSCS size is 0 but SSAT size is non-zero

並包括頂部各種版本:

# -*- coding: utf-8 -*-

我在Windows Server 2008計算機上的python 2.7上運行它。

謝謝大家的反饋!

我最終使用encoding_override函數修復了它。 我無法找到cp代碼對應德語字符的Microsoft文檔,所以我嘗試了所有這些。 最終我得到了cp1251,它有效!

workbook = xlrd.open_workbook(path, encoding_override="cp1251")

從我對OOo文檔的閱讀中,xls使用了unfode的utf_16_le風格,而不是utf8(即每個字符存儲的小端使用兩個字節),請嘗試:

workbook = xlrd.open_workbook('export_data.xls', encoding_override="utf_16_le")

(見http://www.openoffice.org/sc/excelfileformat.pdf第17頁)

有點晚了,但我希望你嘗試使用unicodecsv進行編碼。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM