简体   繁体   English

UnicodeEncodeError:'ascii'编解码器无法编码字符u'\\ xa3'

[英]UnicodeEncodeError: 'ascii' codec can't encode character u'\xa3'

I have an Excel spreadsheet that I'm reading in that contains some £ signs. 我有一个Excel电子表格,我正在阅读其中包含一些£符号。

When I try to read it in using the xlrd module, I get the following error: 当我尝试使用xlrd模块读取它时,我收到以下错误:

x = table.cell_value(row, col)
x = x.decode("ISO-8859-1")
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa3' in position 0: ordinal not in range(128)

If I rewrite this to x.encode('utf-8') it stops throwing an error, but unfortunately when I then write the data out somewhere else (as latin-1), the £ signs have all become garbled. 如果我将其重写为x.encode('utf-8'),它会停止抛出错误,但不幸的是,当我将数据写入其他地方时(如latin-1),£符号都变得乱码。

How can I fix this, and read the £ signs in correctly? 我该如何解决这个问题,并正确阅读英镑符号?

--- UPDATE --- ---更新---

Some kind readers have suggested that I don't need to decode it at all, or that I can just encode it to Latin-1 when I need to. 某些读者建议我根本不需要解码它,或者我可以在需要时将其编码为Latin-1。 The problem with this is that I need to write the data to a CSV file eventually, and it seems to object to the raw strings. 这个问题是我最终需要将数据写入CSV文件,它似乎反对原始字符串。

If I don't encode or decode the data at all, then this happens (after I've added the string to an array called items): 如果我根本不对数据进行编码或解码,则会发生这种情况(在我将字符串添加到名为items的数组之后):

for item in items:
    #item = [x.encode('latin-1') for x in item]
    cleancsv.writerow(item)
File "clean_up_barnet.py", line 104, in <module>
 cleancsv.writerow(item)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2022' in position 43: ordinal not in range(128)

I get the same error even if I uncomment the Latin-1 line. 即使我取消注释Latin-1行,我也会得到相同的错误。

A very easy way around all the "'ascii' codec can't encode character…" issues with csvwriter is to instead use unicodecsv , a drop-in replacement for csvwriter. 围绕所有“'ascii'编解码器无法编码字符...”的一个非常简单的方法csvwriter的问题是使用unicodecsv ,csvwriter的替代品。

Install unicodecsv with pip and then you can use it in the exact same way, eg: 使用pip安装unicodecsv然后你可以以完全相同的方式使用它,例如:

import unicodecsv
file = open('users.csv', 'w')
w = unicodecsv.writer(file)
for user in User.objects.all().values_list('first_name', 'last_name', 'email', 'last_login'):
    w.writerow(user)

For what it's worth: I'm the author of xlrd . 值得的是:我是xlrd的作者。

Does xlrd produce unicode? xlrd产生unicode吗?
Option 1: Read the Unicode section at the bottom of the first screenful of xlrd doc: This module presents all text strings as Python unicode objects. 选项1:阅读xlrd doc第一xlrd底部的Unicode部分: 该模块将所有文本字符串显示为Python unicode对象。
Option 2: print type(text), repr(text) 选项2: print type(text), repr(text)

You say """If I rewrite this to x.encode('utf-8') it stops throwing an error, but unfortunately when I then write the data out somewhere else (as latin-1), the £ signs have all become garbled.""" Of course if you write UTF-8-encoded text to a device that's expecting latin1, it will be garbled. 你说“”如果我把它重写为x.encode('utf-8')它就会停止抛出一个错误,但不幸的是当我把数据写到其他地方时(如latin-1),£符号都变成了当然,如果你把UTF-8编码的文本写入一个期望latin1的设备,它将会出现乱码。 What do did you expect? 你有什么期望?

You say in your edit: """I get the same error even if I uncomment the Latin-1 line""". 你在你的编辑中说:“”“即使我取消注释Latin-1行”“”我也会得到同样的错误。 This is very unlikely -- much more likely is that you got a slightly different error (mentioning the latin1 codec instead of the ascii codec) in a different source line (the uncommented latin1 line instead of the writerow line). 这是非常不可能的 - 更有可能的是你在不同的源代码行(未注释的latin1行而不是writerow行)中出现了稍微不同的错误(提到latin1编解码器而不是ascii编解码器)。 Reading error messages carefully aids understanding. 仔细阅读错误消息有助于理解。

Your problem here is that in general your data is NOT encodable in latin1; 你的问题是,一般来说你的数据不能用latin1编码; very little real-world data is. 现实世界的数据很少。 Your POUND SIGN is encodable in latin1, but that's not all your non-ASCII data. 你的POUND SIGN可以在latin1中编码,但这不是你所有的非ASCII数据。 The problematic character is U+2022 BULLET which is not encodable in latin1. 有问题的角色是U + 2022 BULLET,在latin1中无法编码。

It would have helped you get a better answer sooner if you had mentioned up front that you were working on Mac OS X ... the usual suspect for a CSV-suitable encoding is cp1252 (Windows), not mac-roman . 如果你cp1252提到过你在Mac OS X上工作,那么它会帮助你更快地得到更好的答案...对于适合CSV的编码的通常怀疑是cp1252 (Windows),而不是mac-roman

Your code snippet says x.decode , but you're getting an encode error -- meaning x is Unicode already, so, to "decode" it, it must be first turned into a string of bytes (and that's where the default codec ansi comes up and fails). 你的代码片段说的是x.decode ,但是你得到了一个编码错误 - 意思是x已经是Unicode了,所以,为了“解码”它,它必须首先变成一个字节串(这就是默认的编解码器ansi出现并失败)。 In your text then you say "if I rewrite ot to x. encode "... which seems to imply that you do know x is Unicode. 在你的文本,然后你说:“如果我重写OT对x。 编码 ” ......这似乎意味着,你知道 X是Unicode。

So what it IS you're doing -- and what it is you mean to be doing -- encoding a unicode x to get a coded string of bytes, or decoding a string of bytes into a unicode object? 那么你正在做什么 - 以及你的意思是做什么 - 编码unicode x来获得一个编码的字节串,或者将一串字节解码成一个unicode对象?

I find it unfortunate that you can call encode on a byte string, and decode on a unicode object, because I find it seems to lead users to nothing but confusion... but at least in this case you seem to manage to propagate the confusion (at least to me;-). 我发现很遗憾你可以在一个字节字符串上调用encode ,并在一个unicode对象上decode ,因为我发现它似乎引导用户除了混乱......但至少在这种情况下你似乎设法传播混乱(至少对我来说;-)。

If, as it seems, x is unicode, then you never want to "decode" it -- you may want to encode it to get a byte string with a certain codec, eg latin-1, if that's what you need for some kind of I/O purposes (for your own internal program use I recommend sticking with unicode all the time -- only encode/decode if and when you absolutely need , or receive , coded byte strings for input / output purposes). 如果看起来x是unicode,那么你永远不想“解码”它 - 你可能想要对它进行编码以获得带有某个编解码器的字节串,例如latin-1,如果这是你需要的某种类型I / O目的(对于您自己的内部程序使用,我建议始终坚持使用unicode - 只有在您绝对需要接收编码字节字符串时才进行编码/解码以进行输入/输出)。

x = x.decode("ISO-8859-1")
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa3' in position 0: ordinal not in range(128)

Look closely: You got a Unicode***Encode***Error calling the decode method. 仔细观察:你有一个Unicode ***编码***错误调用解码方法。

The reason for this is that decode is intended to convert from a byte sequence ( str ) to a unicode object. 其原因是decode旨在从字节序列( str )转换为unicode对象。 But, as John said, xlrd already uses Unicode strings, so x is already a unicode object. 但是,正如约翰所说, xlrd已经使用了Unicode字符串,因此x已经是一个unicode对象。

In this situation, Python 2.x assumes that you meant to decode a str object, so it "helpfully" creates one for you. 在这种情况下,Python的2.x的假定你的意思是解码str对象,因此它“有益”为您创建一个。 But in order to convert a unicode to a str , it needs an encoding, and chooses ASCII because it's the lowest common denominator of character encodings. 但是为了将unicode转换为str ,它需要编码,并选择ASCII,因为它是字符编码的最小公分母。 Your code effectively gets interpreted as 您的代码有效地被解释为

x = x.encode('ascii').decode("ISO-8859-1")

which fails because x contains a non-ASCII character. 失败,因为x包含非ASCII字符。

Since x is already a unicode object, the decode is unnecessary. 由于x已经是unicode对象,因此不需要decode However, now you run into the problem that the Python 2.x csv module doesn't support Unicode. 但是,现在您遇到了Python 2.x csv模块不支持Unicode的问题。 You have to convert your data to str objects. 您必须将数据转换为str对象。

for item in items:
    item = [x.encode('latin-1') for x in item]
    cleancsv.writerow(item)

This would be correct, except that you have the character (U+2022 BULLET) in your data, and Latin-1 can't represent it. 这是正确的,除了你的数据中有字符(U + 2022 BULLET),而Latin-1不能代表它。 There are several ways around this problem: 有几种方法可以解决这个问题:

  • Write x.encode('latin-1', 'ignore') to remove the bullet (or other non-Latin-1 characters). x.encode('latin-1', 'ignore')删除子弹(或其他非Latin-1字符)。
  • Write x.encode('latin-1', 'replace') to replace the bullet with a question mark. x.encode('latin-1', 'replace')用问号替换子弹。
  • Replace the bullets with a Latin-1 character like * or · . 用拉丁字符1替换子弹,如*·
  • Use a character encoding that does contain all the characters you need. 使用的字符编码, 包含所有你需要的字符。

These days, UTF-8 is widely supported, so there is little reason to use any other encoding for text files. 目前,UTF-8得到广泛支持,因此几乎没有理由对文本文件使用任何其他编码。

xlrd works with Unicode, so the string you get back is a Unicode string. xlrd与Unicode一起使用,因此您获取的字符串是Unicode字符串。 The £-sign has code point U+00A3, so the representation of said string should be u'\\xa3' . £ - 符号的代码点为U + 00A3,因此所述字符串的表示应为u'\\xa3' This has been read in correctly; 这已被正确读入; it is the string that you should be working with throughout your program. 它是您在整个程序中应该使用的字符串。

When you write this (abstract, Unicode) string somewhere, you need to choose an encoding. 在某处编写此(抽象,Unicode)字符串时,需要选择编码。 At that point, you should .encode it into that encoding, say latin-1 . 那时候,你应该.encode它编码成那个编码,比如latin-1


>>> book = xlrd.open_workbook( "test.xls" )
>>> sh = book.sheet_by_index( 0 )
>>> x = sh.cell_value( 0, 0 )
>>> x
u'\xa3'
>>> print x
£

# sample outputs (for e.g. writing to a file)
>>> x.encode( "latin-1" )
'\xa3'
>>> x.encode( "utf-8" )
'\xc2\xa3'

# garbage, because x is already Unicode
>>> x.decode( "ascii" )
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa3' in position 0:
ordinal not in range(128)
>>>

Working with xlrd, I have in a line ...xl_data.find(str(cell_value))... which gives the error:"'ascii' codec can't encode character u'\\xdf' in position 3: ordinal not in range(128)". 使用xlrd,我有一行... xl_data.find(str(cell_value))...它给出错误:“'ascii'编解码器不能编码位置3中的字符u'\\ xdf':序数不是在范围(128)“。 All suggestions in the forums have been useless for my german words. 论坛中的所有建议对我的德语单词都没用。 But changing into: ...xl_data.find(cell.value)... gives no error. 但改成:... xl_data.find(cell.value)...没有错误。 So, I suppose using strings as arguments in certain commands with xldr has specific encoding problems. 因此,我认为在某些命令中使用字符串作为参数,xldr具有特定的编码问题。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 UnicodeEncodeError:“ascii”编解码器无法在位置 20 编码字符 u&#39;\\xa0&#39;:序数不在范围内(128) - UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 20: ordinal not in range(128) UnicodeEncodeError:&#39;ascii&#39;编解码器无法在位置4编码字符u&#39;\\ xa0&#39;:序数不在范围内(128) - UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 4: ordinal not in range(128) UnicodeEncodeError:&#39;ascii&#39;编解码器无法在位置37编码字符u&#39;\\ xa0&#39;:序数不在范围内(128) - UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 37: ordinal not in range(128) UnicodeEncodeError:&#39;ascii&#39;编解码器不能编码字符u&#39;\\ xe4&#39; - UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' UnicodeEncodeError:&#39;ascii&#39;编解码器不能编码字符u&#39;\\ xef&#39; - UnicodeEncodeError: 'ascii' codec can't encode character u'\xef' UnicodeEncodeError:&#39;ascii&#39;编解码器不能编码字符u&#39;\\ xe9&#39; - UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' UnicodeEncodeError: &#39;ascii&#39; 编解码器无法编码字符 &#39;\’&#39; - UnicodeEncodeError: 'ascii' codec can't encode character '\u2019' UnicodeEncodeError:&#39;ascii&#39;编解码器无法编码字符u&#39;\\ xe9&#39; - UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' 如何修复 Python 中的“UnicodeEncodeError: &#39;ascii&#39; codec can&#39;t encode character u&#39;\\xa0&#39; in position 3656: ordinal not in range(128)”错误 - How to fix "UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 3656: ordinal not in range(128)" error in Python UnicodeEncodeError:&#39;ascii&#39;编解码器不能编码字符[...] - UnicodeEncodeError: 'ascii' codec can't encode character […]
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM