Python：使用Unicode字符

Question

我正在嘗試學習如何在python中使用Unicode。

假設我有一個包含Unicode字符的文件test ： áéíóúabcdefgçë我想制作一個python腳本，打印出文件中所有唯一的字符。 這就是我所擁有的：

#!/usr/bin/python

import sys

def main():
    if len(sys.argv) < 2:
        print("Argument required.")
        exit()
    else:
        filename = sys.argv[1]
        with open(filename, "r") as fp:
            string = fp.read().replace('\n', '')
        chars = set()
        for char in string:
            chars.add(char)
        for char in chars:
            sys.stdout.write(char)
        print("")

if __name__ == "__main__":
    main()

這不能正確打印Unicode字符：

$ ./unicode.py test
▒a▒bedgf▒▒▒▒c▒▒

什么是正確的方法，以使字符正確打印？

Answer 1

您的數據已編碼，最有可能是utf-8。 Utf-8使用多個字節來編碼非ASCII字符，例如áéíóú 。 遍歷編碼為utf-8的字符串會產生組成該字符串的各個字節，而不是您期望的字符。

>>> s = 'áéíóúabcdefgçë'
# There are 14 characters in s, but it contains 21 bytes
>>> len(s)
21
>>> s
'\xc3\xa1\xc3\xa9\xc3\xad\xc3\xb3\xc3\xbaabcdefg\xc3\xa7\xc3\xab'

# The first "character" (actually, byte) is unprintable.
>>> print s[0]
�
# So is the second.
>>> print s[1]
�
# But together they make up a character.
>>> print s[0:2]
á

因此，打印單個字節無法按預期工作。

>>> for c in s:print c,
... 
� � � � � � � � � � a b c d e f g � � � �

但是將字符串解碼為unicode，然后進行打印。

>>> for c in s.decode('utf-8'):print c,
... 
á é í ó ú a b c d e f g ç ë

為了使代碼按預期工作，您需要對從文件中讀取的字符串進行解碼。 更改

string = fp.read().replace('\n', '')

至

string = fp.read().replace('\n', '').decode('utf-8')

Answer 2

這取決於您使用的Python版本：

1.對於python 2，沒有對Unicode字符的本機支持，因此有必要保留明確的標題，例如：

# -*-coding:utf-8-*-

2.對於Python 3的支持是天然的，因為它說在這里。

因此， UTF-8編碼已經具有本機支持。

Python：使用Unicode字符

問題描述

2 個解決方案

解決方案1
1 已采納 2018-07-12 05:23:02

解決方案2
-2 2018-07-11 22:06:26

Python：使用Unicode字符

問題描述

2 個解決方案

解決方案1 1 已采納 2018-07-12 05:23:02

解決方案2 -2 2018-07-11 22:06:26

解決方案1
1 已采納 2018-07-12 05:23:02

解決方案2
-2 2018-07-11 22:06:26