简体   繁体   中英

reading unicode from xls in python

I'm trying to read in with Python an .xls file. The file contains multiple non-ascii characters (namely, äöü). I've tried both with openpyxls and xlrd (I had high hopes with xlrd, since it supposedly reads in everything in unicode anyway), with neither working.

I've found multiples answers dealing with encoding/decoding while trying to print information from the xls, but I can't even seem to get that far. This scrip errors out right after simply trying to read the file:

import xlrd
workbook = xlrd.open_workbook('export_data.xls')

Resulting in:

Traceback (most recent call last):
  File "C:\Users\Administrator\workspace\tufinderxlstoxml\tufinderxlstoxml2.py", line 2, in <module>
    workbook = xlrd.open_workbook('export_data.xls')
  File "C:\Python27_32\lib\site-packages\xlrd\__init__.py", line 435, in open_workbook
    ragged_rows=ragged_rows,
  File "C:\Python27_32\lib\site-packages\xlrd\book.py", line 119, in open_workbook_xls
    bk.get_sheets()
  File "C:\Python27_32\lib\site-packages\xlrd\book.py", line 705, in get_sheets
    self.get_sheet(sheetno)
  File "C:\Python27_32\lib\site-packages\xlrd\book.py", line 696, in get_sheet
    sh.read(self)
  File "C:\Python27_32\lib\site-packages\xlrd\sheet.py", line 796, in read
    strg = unpack_string(data, 6, bk.encoding or bk.derive_encoding(), lenlen=2)
  File "C:\Python27_32\lib\site-packages\xlrd\biffh.py", line 269, in unpack_string
    return unicode(data[pos:pos+nchars], encoding)
UnicodeDecodeError: 'ascii' codec can't decode byte 0x92 in position 55: ordinal not in range(128)
WARNING *** OLE2 inconsistency: SSCS size is 0 but SSAT size is non-zero
*** No CODEPAGE record, no encoding_override: will use 'ascii'
*** No CODEPAGE record, no encoding_override: will use 'ascii'

I've also tried:

workbook = xlrd.open_workbook('export_data.xls', encoding_override="utf-8")

resulting in:

Traceback (most recent call last):
  File "C:\Users\Administrator\workspace\tufinderxlstoxml\tufinderxlstoxml2.py", line 2, in <module>
    workbook = xlrd.open_workbook('export_data.xls', encoding_override="utf-8")
  File "C:\Python27_32\lib\site-packages\xlrd\__init__.py", line 435, in open_workbook
    ragged_rows=ragged_rows,
  File "C:\Python27_32\lib\site-packages\xlrd\book.py", line 119, in open_workbook_xls
    bk.get_sheets()
  File "C:\Python27_32\lib\site-packages\xlrd\book.py", line 705, in get_sheets
    self.get_sheet(sheetno)
  File "C:\Python27_32\lib\site-packages\xlrd\book.py", line 696, in get_sheet
    sh.read(self)
  File "C:\Python27_32\lib\site-packages\xlrd\sheet.py", line 796, in read
    strg = unpack_string(data, 6, bk.encoding or bk.derive_encoding(), lenlen=2)
  File "C:\Python27_32\lib\site-packages\xlrd\biffh.py", line 269, in unpack_string
    return unicode(data[pos:pos+nchars], encoding)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 55: invalid start byte
WARNING *** OLE2 inconsistency: SSCS size is 0 but SSAT size is non-zero

and including at the top various versions of:

# -*- coding: utf-8 -*-

I'm running this on python 2.7 on a Windows Server 2008 machine.

Thanks all for the feedback!

I did eventually get it fixed using the encoding_override function. I wasn't able to find Microsoft documentation for which cp code corresponds to German characters, so I tried them all. Eventually I got to cp1251 and it worked!

workbook = xlrd.open_workbook(path, encoding_override="cp1251")

From my reading of the OOo docs, xls has used the utf_16_le flavour of unicode, not utf8 (that is it uses exactly two bytes per character stored little-endian), so try:

workbook = xlrd.open_workbook('export_data.xls', encoding_override="utf_16_le")

(see page 17 of http://www.openoffice.org/sc/excelfileformat.pdf )

有点晚了,但我希望你尝试使用unicodecsv进行编码。

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM