[英]Python: working with unicode characters
I'm trying to learn how to work with Unicode in python. 我正在尝试学习如何在python中使用Unicode。
Let's say I have a file test
containing Unicode characters: áéíóúabcdefgçë
I want to make a python script that prints out all the unique characters in the file. 假设我有一个包含Unicode字符的文件test
: áéíóúabcdefgçë
我想制作一个python脚本,打印出文件中所有唯一的字符。 This is what I have: 这就是我所拥有的:
#!/usr/bin/python
import sys
def main():
if len(sys.argv) < 2:
print("Argument required.")
exit()
else:
filename = sys.argv[1]
with open(filename, "r") as fp:
string = fp.read().replace('\n', '')
chars = set()
for char in string:
chars.add(char)
for char in chars:
sys.stdout.write(char)
print("")
if __name__ == "__main__":
main()
This doesn't print the Unicode characters properly: 这不能正确打印Unicode字符:
$ ./unicode.py test
▒a▒bedgf▒▒▒▒c▒▒
What is the correct way to do this, so that the characters print properly? 什么是正确的方法,以使字符正确打印?
Your data is encoded, most likely as utf-8. 您的数据已编码,最有可能是utf-8。 Utf-8 uses more than one byte to encode non-ascii characters, such as áéíóú
. Utf-8使用多个字节来编码非ASCII字符,例如áéíóú
。 Iterating over a string encoded as utf-8 yields the individual bytes that make up the string, rather than the characters that you are expecting. 遍历编码为utf-8的字符串会产生组成该字符串的各个字节 ,而不是您期望的字符 。
>>> s = 'áéíóúabcdefgçë'
# There are 14 characters in s, but it contains 21 bytes
>>> len(s)
21
>>> s
'\xc3\xa1\xc3\xa9\xc3\xad\xc3\xb3\xc3\xbaabcdefg\xc3\xa7\xc3\xab'
# The first "character" (actually, byte) is unprintable.
>>> print s[0]
�
# So is the second.
>>> print s[1]
�
# But together they make up a character.
>>> print s[0:2]
á
So printing individual bytes doesn't work as expected. 因此,打印单个字节无法按预期工作。
>>> for c in s:print c,
...
� � � � � � � � � � a b c d e f g � � � �
But decoding the string to unicode, then printing does. 但是将字符串解码为unicode,然后进行打印。
>>> for c in s.decode('utf-8'):print c,
...
á é í ó ú a b c d e f g ç ë
To make your code work as you expect, you need to decode the string you read from the file. 为了使代码按预期工作,您需要对从文件中读取的字符串进行解码。 Change 更改
string = fp.read().replace('\n', '')
to 至
string = fp.read().replace('\n', '').decode('utf-8')
This depends on the version of Python you are using: 这取决于您使用的Python版本:
1. For the python 2, there was no native support for Unicode characters, so it was necessary to leave explicit, with a header such as: 1.对于python 2,没有对Unicode字符的本机支持,因此有必要保留明确的标题,例如:
# -*-coding:utf-8-*-
2. For python 3 The support is native, as it says here . 2.对于Python 3的支持是天然的,因为它说在这里 。
So the UTF-8
encoding already has native support. 因此, UTF-8
编码已经具有本机支持。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.