使用encode（'utf-8'）在python中從Excel讀取字符串的缺點

Question

我正在從excel電子表格中讀取大量數據，其中使用以下一般結構從電子表格中讀取（以及重新格式化和重寫）：

book = open_workbook('file.xls')
sheettwo = book.sheet_by_index(1)
out = open('output.file', 'w')
for i in range(sheettwo.nrows):
     z = i + 1
     toprint = """formatting of the data im writing. important stuff is to the right -> """ + str(sheettwo.cell(z,y).value) + """ more formatting! """ + str(sheettwo.cell(z,x).value.encode('utf-8')) + """ and done"""
     out.write(toprint)
     out.write("\n")

在這種情況下，其中x和y是任意單元格，其中x的任意性較小，並且包含utf-8字符

到目前為止，我只在我知道會有錯誤的單元格中使用.encode（'utf-8'），否則將不使用utf-8而預見到錯誤。

我的問題基本上是這樣的：即使沒有必要，在所有單元格上使用.encode（'utf-8'）也有不利之處嗎？ 效率不是問題。 主要問題是，即使在不應該存在utf-8字符的地方也可以使用。 如果僅將“ .encode（'utf-8'）”聚集到每個讀取的單元格上，如果不會發生任何錯誤，則可能最終會這樣做。

Answer 1

XLRD文檔明確指出：“從Excel 97開始，Excel電子表格中的文本已存儲為Unicode。”。 由於您可能正在讀取97以后的文件，因此它們仍然包含Unicode代碼點。 因此，有必要在Python中將這些單元格的內容保留為Unicode，並且不要將其轉換為ASCII碼（您可以通過str（）函數來實現）。 在下面使用此代碼：

book = open_workbook('file.xls')
sheettwo = book.sheet_by_index(1)
#Make sure your writing Unicode encoded in UTF-8
out = open('output.file', 'w')
for i in range(sheettwo.nrows):
    z = i + 1
    toprint = u"formatting of the data im writing. important stuff is to the right -> " + unicode(sheettwo.cell(z,y).value) + u" more formatting! " + unicode(sheettwo.cell(z,x).value) + u" and done\n"
    out.write(toprint.encode('UTF-8'))

Answer 2

這個答案實際上是對已接受答案的一些溫和注釋，但是與SO注釋功能相比，它們需要更好的格式。

（1）避免使用SO水平滾動條會增加人們閱讀您的代碼的機會。 嘗試換行，例如：

toprint = u"".join([
    u"formatting of the data im writing. "
    u"important stuff is to the right -> ",
    unicode(sheettwo.cell(z,y).value),
    u" more formatting! ",
    unicode(sheettwo.cell(z,x).value),
    u" and done\n"
    ])
out.write(toprint.encode('UTF-8'))

（2）大概是在使用unicode()將float和int轉換為unicode； 對於已經是unicode的值，它不執行任何操作。 請注意，與str （）一樣， unicode()只能為浮點數提供12位數字的精度：

>>> unicode(123456.78901234567)
u'123456.789012'

如果這很麻煩，您可能想嘗試以下方法：

>>> def full_precision(x):
>>> ... return unicode(repr(x) if isinstance(x, float) else x)
>>> ...
>>> full_precision(u'\u0400')
u'\u0400'
>>> full_precision(1234)
u'1234'
>>> full_precision(123456.78901234567)
u'123456.78901234567'

（3） xlrd在需要時xlrd構建Cell對象。

sheettwo.cell(z,y).value # slower
sheettwo.cell_value(z,y) # faster

使用encode（'utf-8'）在python中從Excel讀取字符串的缺點

問題描述

2 個解決方案

解決方案1
4 已采納 2011-10-13 03:17:25

解決方案2
0 2011-10-30 08:51:25

使用encode（&#39;utf-8&#39;）在python中從Excel讀取字符串的缺點

問題描述

2 個解決方案

解決方案1 4 已采納 2011-10-13 03:17:25

解決方案2 0 2011-10-30 08:51:25

使用encode（'utf-8'）在python中從Excel讀取字符串的缺點

解決方案1
4 已采納 2011-10-13 03:17:25

解決方案2
0 2011-10-30 08:51:25