使用encode（'utf-8'）在python中从Excel读取字符串的缺点

Question

我正在从excel电子表格中读取大量数据，其中使用以下一般结构从电子表格中读取（以及重新格式化和重写）：

book = open_workbook('file.xls')
sheettwo = book.sheet_by_index(1)
out = open('output.file', 'w')
for i in range(sheettwo.nrows):
     z = i + 1
     toprint = """formatting of the data im writing. important stuff is to the right -> """ + str(sheettwo.cell(z,y).value) + """ more formatting! """ + str(sheettwo.cell(z,x).value.encode('utf-8')) + """ and done"""
     out.write(toprint)
     out.write("\n")

在这种情况下，其中x和y是任意单元格，其中x的任意性较小，并且包含utf-8字符

到目前为止，我只在我知道会有错误的单元格中使用.encode（'utf-8'），否则将不使用utf-8而预见到错误。

我的问题基本上是这样的：即使没有必要，在所有单元格上使用.encode（'utf-8'）也有不利之处吗？ 效率不是问题。 主要问题是，即使在不应该存在utf-8字符的地方也可以使用。 如果仅将“ .encode（'utf-8'）”聚集到每个读取的单元格上，如果不会发生任何错误，则可能最终会这样做。

Answer 1

XLRD文档明确指出：“从Excel 97开始，Excel电子表格中的文本已存储为Unicode。”。 由于您可能正在读取97以后的文件，因此它们仍然包含Unicode代码点。 因此，有必要在Python中将这些单元格的内容保留为Unicode，并且不要将其转换为ASCII码（您可以通过str（）函数来实现）。 在下面使用此代码：

book = open_workbook('file.xls')
sheettwo = book.sheet_by_index(1)
#Make sure your writing Unicode encoded in UTF-8
out = open('output.file', 'w')
for i in range(sheettwo.nrows):
    z = i + 1
    toprint = u"formatting of the data im writing. important stuff is to the right -> " + unicode(sheettwo.cell(z,y).value) + u" more formatting! " + unicode(sheettwo.cell(z,x).value) + u" and done\n"
    out.write(toprint.encode('UTF-8'))

Answer 2

这个答案实际上是对已接受答案的一些温和注释，但是与SO注释功能相比，它们需要更好的格式。

（1）避免使用SO水平滚动条会增加人们阅读您的代码的机会。 尝试换行，例如：

toprint = u"".join([
    u"formatting of the data im writing. "
    u"important stuff is to the right -> ",
    unicode(sheettwo.cell(z,y).value),
    u" more formatting! ",
    unicode(sheettwo.cell(z,x).value),
    u" and done\n"
    ])
out.write(toprint.encode('UTF-8'))

（2）大概是在使用unicode()将float和int转换为unicode； 对于已经是unicode的值，它不执行任何操作。 请注意，与str （）一样， unicode()只能为浮点数提供12位数字的精度：

>>> unicode(123456.78901234567)
u'123456.789012'

如果这很麻烦，您可能想尝试以下方法：

>>> def full_precision(x):
>>> ... return unicode(repr(x) if isinstance(x, float) else x)
>>> ...
>>> full_precision(u'\u0400')
u'\u0400'
>>> full_precision(1234)
u'1234'
>>> full_precision(123456.78901234567)
u'123456.78901234567'

（3） xlrd在需要时xlrd构建Cell对象。

sheettwo.cell(z,y).value # slower
sheettwo.cell_value(z,y) # faster

使用encode（'utf-8'）在python中从Excel读取字符串的缺点

问题描述

2 个解决方案

解决方案1
4 已采纳 2011-10-13 03:17:25

解决方案2
0 2011-10-30 08:51:25

使用encode（&#39;utf-8&#39;）在python中从Excel读取字符串的缺点

问题描述

2 个解决方案

解决方案1 4 已采纳 2011-10-13 03:17:25

解决方案2 0 2011-10-30 08:51:25

使用encode（'utf-8'）在python中从Excel读取字符串的缺点

解决方案1
4 已采纳 2011-10-13 03:17:25

解决方案2
0 2011-10-30 08:51:25