简体   繁体   English

使用encode('utf-8')在python中从Excel读取字符串的缺点

[英]Downsides to reading strings from Excel in python using encode('utf-8')

I am reading a large amount of data from an excel spreadsheet in which I read (and reformat and rewrite) from the spreadsheet using the following general structure: 我正在从excel电子表格中读取大量数据,其中使用以下一般结构从电子表格中读取(以及重新格式化和重写):

book = open_workbook('file.xls')
sheettwo = book.sheet_by_index(1)
out = open('output.file', 'w')
for i in range(sheettwo.nrows):
     z = i + 1
     toprint = """formatting of the data im writing. important stuff is to the right -> """ + str(sheettwo.cell(z,y).value) + """ more formatting! """ + str(sheettwo.cell(z,x).value.encode('utf-8')) + """ and done"""
     out.write(toprint)
     out.write("\n")

where x and y are arbitrary cells in this case, with x being less arbitrary and containing utf-8 characters 在这种情况下,其中x和y是任意单元格,其中x的任意性较小,并且包含utf-8字符

So far I have only been using the .encode('utf-8') in cells where I know there will be errors otherwise or foresee an error without using utf-8. 到目前为止,我只在我知道会有错误的单元格中使用.encode('utf-8'),否则将不使用utf-8而预见到错误。

My question is basically this: is there a disadvantage to using .encode('utf-8') on all of the cells even if it is unnecessary? 我的问题基本上是这样的:即使没有必要,在所有单元格上使用.encode('utf-8')也有不利之处吗? Efficiency is not an issue. 效率不是问题。 the main issue is that it works even if there is a utf-8 character in a place there shouldn't be. 主要问题是,即使在不应该存在utf-8字符的地方也可以使用。 If no errors would occur if I just lump the ".encode('utf-8')" onto every cell read, I will probably end up doing that. 如果仅将“ .encode('utf-8')”聚集到每个读取的单元格上,如果不会发生任何错误,则可能最终会这样做。

The XLRD Documentation states it clearly: "From Excel 97 onwards, text in Excel spreadsheets has been stored as Unicode.". XLRD文档明确指出:“从Excel 97开始,Excel电子表格中的文本已存储为Unicode。”。 Since you are likely reading in files newer than 97, they are containing Unicode codepoints anyway. 由于您可能正在读取97以后的文件,因此它们仍然包含Unicode代码点。 It is therefore necessary that keep the content of these cells as Unicode within Python and do not convert them to ASCII (which you do in with the str() function). 因此,有必要在Python中将这些单元格的内容保留为Unicode,并且不要将其转换为ASCII码(您可以通过str()函数来实现)。 Use this code below: 在下面使用此代码:

book = open_workbook('file.xls')
sheettwo = book.sheet_by_index(1)
#Make sure your writing Unicode encoded in UTF-8
out = open('output.file', 'w')
for i in range(sheettwo.nrows):
    z = i + 1
    toprint = u"formatting of the data im writing. important stuff is to the right -> " + unicode(sheettwo.cell(z,y).value) + u" more formatting! " + unicode(sheettwo.cell(z,x).value) + u" and done\n"
    out.write(toprint.encode('UTF-8'))

This answer is really a few mild comments on the accepted answer, but they need better formatting than the SO comment facility provides. 这个答案实际上是对已接受答案的一些温和注释,但是与SO注释功能相比,它们需要更好的格式。

(1) Avoiding the SO horizontal scrollbar enhances the chance that people will read your code. (1)避免使用SO水平滚动条会增加人们阅读您的代码的机会。 Try wrapping your lines, for example: 尝试换行,例如:

toprint = u"".join([
    u"formatting of the data im writing. "
    u"important stuff is to the right -> ",
    unicode(sheettwo.cell(z,y).value),
    u" more formatting! ",
    unicode(sheettwo.cell(z,x).value),
    u" and done\n"
    ])
out.write(toprint.encode('UTF-8'))

(2) Presumably you are using unicode() to convert floats and ints to unicode; (2)大概是在使用unicode()将float和int转换为unicode; it does nothing for values that are already unicode. 对于已经是unicode的值,它不执行任何操作。 Be aware that unicode() , like str (), gives you only 12 digits of precision for floats: 请注意,与str ()一样, unicode()只能为浮点数提供12位数字的精度:

>>> unicode(123456.78901234567)
u'123456.789012'

If that is a bother, you might like to try something like this: 如果这很麻烦,您可能想尝试以下方法:

>>> def full_precision(x):
>>> ... return unicode(repr(x) if isinstance(x, float) else x)
>>> ...
>>> full_precision(u'\u0400')
u'\u0400'
>>> full_precision(1234)
u'1234'
>>> full_precision(123456.78901234567)
u'123456.78901234567'

(3) xlrd builds Cell objects on the fly when demanded. (3) xlrd在需要时xlrd构建Cell对象。

sheettwo.cell(z,y).value # slower
sheettwo.cell_value(z,y) # faster

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM