简体   繁体   English

python中u''前缀和unicode()有什么区别?

[英]What is the difference between u' ' prefix and unicode() in python?

What is the difference between u'' prefix and unicode() ? u''前缀和unicode()什么区别?

# -*- coding: utf-8 -*-
print u'上午'  # this works
print unicode('上午', errors='ignore') # this works but print out nothing
print unicode('上午') # error

For the third print , the error shows: UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 0 对于第三个print ,错误显示:UnicodeDecodeError:'ascii'编解码器无法解码位置0中的字节0xe4

If I have a text file containing non-ascii characters, such as "上午", how to read it and print it out correctly? 如果我有一个包含非ascii字符的文本文件,例如“上午”,如何阅读并正确打印出来?

  • u'..' is a string literal, and decodes the characters according to the source encoding declaration. u'..'是一个字符串文字,并根据源编码声明解码字符。

  • unicode() is a function that converts another type to a unicode object, you've given it a byte string literal . unicode()是一个将另一个类型转换为unicode对象的函数,你给它一个字节字符串文字 It'll decode a byte string according to the default ASCII codec. 它将根据默认的ASCII编解码器解码字节字符串。

So you created a byte string object using a different type of literal notation, then tried to convert it to a unicode() object, which fails because the default codec for str -> unicode conversions is ASCII. 因此,您使用不同类型的文字表示法创建了一个字节字符串对象,然后尝试将其转换为unicode()对象,该对象失败,因为str - > unicode转换的默认编解码器是ASCII。

The two are quite different beasts. 这两个是完全不同的野兽。 If you want to use the latter, you need to give it an explicit codec: 如果你想使用后者,你需要给它一个明确的编解码器:

print unicode('上午', 'utf8')

The two are related in the same way that using 0xFF and int('0xFF', 0) are related; 这两者的相关性与使用0xFFint('0xFF', 0)相关的方式相同; the former defines an integer of value 255 using hex notation, the latter uses the int() function to extract an integer from a string. 前者使用十六进制表示法定义值255的整数,后者使用int()函数从字符串中提取整数。

An alternative method would be to use the str.decode() method : 另一种方法是使用str.decode()方法

print '上午'.decode('utf8')

Don't be tempted to use an error handler (such as ignore' or 'replace' ) unless you know what you are doing. 除非你知道自己在做什么,否则不要试图使用错误处理程序(例如ignore''replace' )。 'ignore' especially can mask underlying issues with having picked the wrong codec, for example. 'ignore'尤其可以掩盖选择错误编解码器的潜在问题。

You may want to read up on Python and Unicode: 您可能想要阅读Python和Unicode:

When a str is not prefixed by u'' in Python 2.7.x , what the interpreter sees is a byte string, without an explicit encoding. strPython 2.7.x没有以u''为前缀时,解释器看到的是一个字节字符串,没有显式编码。

If you do not tell the interpreter what to do with those bytes when executing unicode() , it will (as you saw) default to trying to decode the bytes it sees via the ascii encoding scheme. 如果您没有告诉解释器在执行unicode()时如何处理这些字节,它将(如您所见)默认尝试通过ascii编码方案decode它看到的字节。

It does so as a preliminary step in trying to turn the plain bytes of the str into a unicode object. 它是尝试将str的普通字节转换为unicode对象的初步步骤。

Using ascii to decode means: try to interpret each byte of the str using a hard-coded mapping, a number between 0 and 127 . 使用ascii进行decode意味着:尝试使用硬编码映射( 0127之间的数字)来解释str每个字节。

The error you encountered was like a dict KeyError : the interpreter encountered a byte for which the ascii encoding scheme does not have a specified mapping. 您遇到的错误就像一个dict KeyError :解释器遇到一个ascii编码方案没有指定映射的字节。

Since the interpreter doesn't know what to do with the byte, it throws an error. 由于解释器不知道如何处理字节,因此会抛出错误。

You can change that preliminary step by telling the interpreter to decode the bytes using another set of encoding/decoding mappings instead, one that goes beyond ascii, such as UTF-8 , as elaborated in other answers. 您可以通过告诉解释器使用另一组编码/解码映射来decode字节来改变该初步步骤,而不是ascii,例如UTF-8 ,如其他答案中详细说明的那样。

If the interpreter finds a mapping in the chosen scheme for each byte (or bytes) in the str , it will decode successfully, and the interpreter will use the resulting mappings to produce a unicode object. 如果解释器在str中的每个字节(或字节)中找到所选方案中的映射,则它将成功解码,并且解释器将使用生成的映射来生成unicode对象。

A Python unicode object is a series of Unicode code points . Python unicode对象是一系列Unicode 代码点 There are 1,112,064 valid code points in the Unicode code space . Unicode 代码空间中有1,112,064个有效代码点。

And if the scheme you choose for decoding is the one with which your text (or code points) were encoded, then the output when printing should be identical to the original text. 如果您选择用于解码的方案是您的文本(或代码点)编码的方案,则打印时的输出应与原始文本相同。

Can also consider trying Python 3 . 也可以考虑尝试Python 3 The relevant difference is explained in the first comment below. 相关的差异在下面的第一条评论中解释。

Unicode is an object type whereas 'u' is a literal used to denote that object is unicode object. Unicode是一种对象类型,而“u”是用于表示该对象是unicode对象的文字。 It is similar to 'L' literal used to denote long int. 它类似于用于表示long int的'L'文字。

请尝试:'上午'.decode('utf8','ignore')。encode('utf8')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM