[英]Python Unicode characters
I know the subject is not new, but I tried a lot of solutions, without success. 我知道这个主题不是新话题,但是我尝试了很多解决方案,但没有成功。 I am using Python 2.7 (very not experimented user).
我正在使用Python 2.7(不是非常有经验的用户)。 My problem : I read a file :
我的问题:我读了一个文件:
my_file=open("file")
and then save the one line (which contains the word "pitié" into a variable, then I print it 然后将一行(其中包含“pitié”一词保存到变量中,然后将其打印出来)
line=my_file.readline()
print line
>> pitié
there, I got "pitié" as result. 在那里,我得到了“皮蒂”。 But if I want to manipulate it, I see that my variable (string) contains some bytes :
但是,如果要操作它,我会发现我的变量(字符串)包含一些字节:
line
>> 'piti\xc3\xa9'
My problem is when I need to do some operation to manipulate this string, I need to have the "é" character. 我的问题是,当我需要执行一些操作来操纵此字符串时,我需要具有“é”字符。 For example to put it in a Flask template.
例如,将其放在Flask模板中。 I tried some encode/decode operation, but I'm very confused.
我尝试了一些编码/解码操作,但是我很困惑。 I get the usual
我得到平常的
UnicodeDecodeError: 'ascii' codec can't decode byte 0x.. in position .: ordinal not in range(...)
What does the print function to give the right output ? 打印功能如何提供正确的输出?
Thanks ! 谢谢 !
Welcome to the world of Unicode! 欢迎来到Unicode世界! Your file is saved in UTF-8, a multibyte encoding, so characters outside the ASCII range of 0-127 require two or more bytes.
您的文件以UTF-8(一种多字节编码)保存,因此ASCII范围在0-127之间的字符需要两个或更多字节。 Read the file using the
codecs
or io
module, and declare the encoding so it is read as a Unicode string, and non-ASCII codepoints up to 65535 will be a single codepoint. 使用
codecs
或io
模块读取文件,并声明编码,以便将其读取为Unicode字符串,并且最多65535的非ASCII代码点将是单个代码点。 Switch to Python 3.3+ and all Unicode codepoints will be a single codepoint. 切换到Python 3.3+,所有Unicode代码点将成为一个代码点。
Note the first line of the example below declares the encoding of the source file . 请注意,下面示例的第一行声明了源文件的编码。 It does not have to match the encoding of the data file, but is used so Python knows the encoding of the literal Unicode string
u'é'
in the source. 它不必与数据文件的编码匹配,而是使用它,因此Python知道源中文字Unicode字符串
u'é'
的编码。
#coding: utf8
import io
with io.open('file',encoding='utf8') as my_file:
line = my_file.readline()
print line
print repr(line)
print line.index(u'é')
Output: 输出:
pitié
u'piti\xe9'
4
You're seeing two different display methods: print
shows you the pretty version, and just typing line
gives you the raw "repr" version. 您将看到两种不同的显示方法:
print
向您显示漂亮的版本,而仅键入line
则为您提供原始的“ repr”版本。 Nothing is wrong with the string. 字符串没有问题。 If you write it to a file, it will be just as it was in your original input file.
如果将其写入文件,它将与原始输入文件中的一样。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.