简体   繁体   English

如何检查是否已成功在utf-8中进行编码

[英]How do I check whether have encoded in utf-8 successfully

Given a string 给定一个字符串

u ='abc'

which syntax is the right one to encode into utf8? 哪种语法适合编码为utf8?

u.encode('utf-8')

or 要么

u.encode('utf8')

And how do I know that I have already encoded in utr-8? 我怎么知道我已经用utr-8编码了?

First of all you need to make a distinction if you're talking about Python 2 or Python 3 because unicode handling is one of the biggest differences between the two versions. 首先,如果要谈论Python 2或Python 3,则需要区别对待,因为Unicode处理是两个版本之间最大的区别之一。

Python 2 Python 2

  • unicode type contains text characters unicode类型包含文本字符
  • str contains sequences of 8-bit bytes, sometimes representing text in some unspecified encoding str包含8位字节的序列,有时以某些未指定的编码表示文本
  • s.decode(encoding) takes a sequence bytes and builds a text string out of it, once given the encoding used by the bytes. s.decode(encoding)接受一个序列字节,并在给定字节使用的编码后从其构建文本字符串。 It goes from str to unicode , for example "Citt\\xe0".decode("iso8859-1") will give you the text "Città" (Italian for city) and the same will happen for "Citt\\xc3\\xa0".decode("utf-8") . 它从strunicode ,例如"Citt\\xe0".decode("iso8859-1")将为您提供文本“Città”(意大利语为城市), "Citt\\xc3\\xa0".decode("utf-8")也会出现同样的情况"Citt\\xc3\\xa0".decode("utf-8") The encoding may be omitted and in that case the meaning is "use the default encoding". 可以省略编码,在这种情况下,含义是“使用默认编码”。
  • u.encode(encoding) takes a text string and builds the byte sequence representing it in the given encoding, thus reversing the processing of decode . u.encode(encoding)接受一个文本字符串,并以给定的编码方式构建表示该字符串的字节序列,从而逆转了decode的处理。 It goes from unicode to str . 它从unicodestr As above the encoding can be omitted. 如上所述,可以省略编码。

Part of the confusion when handling unicode with Python is that the language tries to be a bit too smart and does things automatically. 使用Python处理unicode时,造成混淆的部分原因是该语言试图变得有点聪明,并且会自动执行操作。

For example you can call encode also on an str object and the meaning is "encode the text that comes from decoding these bytes when using the default encoding, eventually using the specified encoding or the default encoding if not specified". 例如,你可以调用encode也是对str对象和意思是“编码使用默认的编码时,最终使用指定的编码或者未指定默认的编码,从这些字节解码自带的文本”。

Similarly you can also call decode on an unicode object, meaning "decode the bytes that come from this text when using the default encoding, eventually using the specified encoding". 同样,您也可以在unicode对象上调用decode ,意思是“使用默认编码,最终使用指定的编码,解码来自此文本的字节”。

For example if I write 例如,如果我写

u"Citt\u00e0".decode("utf-8")

Python gives as error: Python给出错误:

UnicodeEncodeError: 'ascii' codec can't encode character u'\\xe0' in position 3: ordinal not in range(128) UnicodeEncodeError:'ascii'编解码器无法在位置3处编码字符u'\\ xe0':序数不在范围内(128)

NOTE: the error is about encoding that failed, while I asked for decoding . 注意:错误是关于编码失败,而我要求解码 The reason is that I asked to decode text (nonsense because that is already "decoded"... it's text) and Python decided to first encode it using the "ascii" encoding and that failed. 原因是我要求解码文本(废话,因为它已经被“解码”了……它是文本),Python决定先使用“ ascii”编码对它进行编码,但失败了。 IMO much better would have to just not have decode on unicode objects and not have encode on string objects: the error message would have been clearer. 更好的是,IMO不必对Unicode对象进行decode ,而不必对字符串对象encode :错误消息会更清晰。

More confusion is that in Python 2 str is used for unencoded bytes , but it's also used everywhere for text and for example string literals are str objects. 更令人困惑的是,在Python 2中, str用于未编码的字节 ,但是它也用于文本,例如字符串文字是str对象。

Python 3 Python 3

To solve some of the issues Python 3 made a few key changes 为了解决某些问题,Python 3进行了一些关键更改

  • str is for text and contains unicode characters, string literals are unicode text str用于文本,包含unicode字符,字符串文字是unicode文本
  • unicode type doesn't exist any more unicode类型不再存在
  • bytes type is used for 8-bit bytes sequences that may represent text in some unspecified encoding bytes类型用于8位字节序列,这些序列可能以某些未指定的编码表示文本

For example in Python 3 例如在Python 3中

'Città'.encode('iso8859-1') → b'Citt\xe0'
'Città'.encode('utf-8')     → b'Citt\xc3\xa0'

also you cannot call decode on text strings and you cannot call encode on byte sequences. 同样,您不能对文本字符串调用decode ,也不能对字节序列调用encode

Failures 失败的

Sometimes encoding text into bytes may fail, because the specified encoding cannot handle all of unicode. 有时,将文本编码为字节可能会失败,因为指定的编码无法处理所有unicode。 For example iso8859-1 cannot handle Chinese. 例如iso8859-1无法处理中文。 These errors can be processed in a few ways like raising an exception (default), or replacing characters that cannot be encoded with something else. 这些错误可以通过几种方式处理,例如引发异常(默认)或替换无法用其他方式编码的字符。

The encoding utf-8 however is able to encode any unicode character and thus encoding to utf-8 never fails. 但是,编码utf-8能够编码任何unicode字符,因此编码为utf-8绝不会失败。 Thus it doesn't make sense to ask how to know if encoding text into utf-8 was done correctly, because it always happens (for utf-8 ). 因此,问如何知道将文本编码为utf-8是否正确是没有意义的,因为它总是会发生(对于utf-8 )。

Also decoding may fail, because the sequence of bytes may make no sense in the specified encoding. 解码也可能失败,因为在指定的编码中字节序列可能没有意义。 For example the sequence of bytes 0x43 0x69 0x74 0x74 0xE0 cannot be interpreted as utf-8 because the byte 0xE0 cannot appear without a proper prefix. 例如,字节0x43 0x69 0x74 0x74 0xE0的序列不能解释为utf-8因为没有适当的前缀就不能出现字节0xE0

There are encodings like iso8859-1 where however decoding cannot fail because any byte 0..255 has a meaning as a character. iso8859-1这样的编码,但是解码不会失败,因为任何字节0..255都具有字符含义。 Most "local encodings" are of this type... they map all 256 possible 8-bit values to some character, but only covering a tiny fraction of the unicode characters. 大多数“本地编码”都是这种类型的...它们将所有256个可能的8位值映射到某个字符,但只覆盖了unicode字符的一小部分。

Decoding using iso8859-1 will never raise an error (any byte sequence is valid) but of course it can give you nonsense text if the bytes where using another encoding. 使用iso8859-1解码将永远不会引发错误(任何字节序列均有效),但如果字节使用其他编码,则当然可以为您提供无用的文本。

First solution: 第一个解决方案:

isinstance(u, unicode)

Second solution: 第二种解决方案:

try:
    u.decode('utf-8')
    print "string is UTF-8, length %d bytes" % len(string)
except UnicodeError:
    print "string is not UTF-8"

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何在 python 2.4 中检查 UTF-8 编码数据(字节) - How do I check for UTF-8 encoded data (bytes) in python 2.4 如何检测文件是否使用UTF-8编码? - How do I detect if a file is encoded using UTF-8? Python UnicodeEncodeError,但我已将参数编码为UTF-8 - Python UnicodeEncodeError, but I have encoded the parameters to UTF-8 在Django中,为什么我对utf-8编码的字符串会遇到问题? - In Django, why do I get problems with utf-8 encoded strings? 如何使用nltk.data.load()从以UTF-8编码的文件中读取CFG? ASCII文件工作正常,但UTF-8编码的文件给出了错误 - How do I read CFG from a file encoded in UTF-8 using nltk.data.load() ? ASCII files works fine but UTF-8 encoded file gives an error json编码为UTF-8字符。 如何在Python请求中作为json处理 - json encoded as UTF-8 characters. How do I process as json in Python Requests Python 2.7检查文件是否使用UTF-8编码 - Python 2.7 check if a file is encoded with UTF-8 如何解码 JavaScript 中的 utf-8 编码字符串? - How can I decode an utf-8 encoded string in JavaScript? 如何在 python(从 utf-8 编码的文本文件导入)中将组合变音符号 ɔ̃、ɛ̃ 和 ɑ̃ 的字符与非重音字符进行比较? - How do I compare characters with combining diacritic marks ɔ̃, ɛ̃ and ɑ̃ to unaccented ones in python (imported from a utf-8 encoded text file)? 如何使用Python 3.2电子邮件模块发送带有quoted-printable的utf-8编码的unicode消息? - How do I use Python 3.2 email module to send unicode messages encoded in utf-8 with quoted-printable?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM