简体   繁体   中英

Python: working with unicode characters

I'm trying to learn how to work with Unicode in python.

Let's say I have a file test containing Unicode characters: áéíóúabcdefgçë I want to make a python script that prints out all the unique characters in the file. This is what I have:

#!/usr/bin/python

import sys

def main():
    if len(sys.argv) < 2:
        print("Argument required.")
        exit()
    else:
        filename = sys.argv[1]
        with open(filename, "r") as fp:
            string = fp.read().replace('\n', '')
        chars = set()
        for char in string:
            chars.add(char)
        for char in chars:
            sys.stdout.write(char)
        print("")

if __name__ == "__main__":
    main()

This doesn't print the Unicode characters properly:

$ ./unicode.py test
▒a▒bedgf▒▒▒▒c▒▒

What is the correct way to do this, so that the characters print properly?

Your data is encoded, most likely as utf-8. Utf-8 uses more than one byte to encode non-ascii characters, such as áéíóú . Iterating over a string encoded as utf-8 yields the individual bytes that make up the string, rather than the characters that you are expecting.

>>> s = 'áéíóúabcdefgçë'
# There are 14 characters in s, but it contains 21 bytes
>>> len(s)
21
>>> s
'\xc3\xa1\xc3\xa9\xc3\xad\xc3\xb3\xc3\xbaabcdefg\xc3\xa7\xc3\xab'

# The first "character" (actually, byte) is unprintable.
>>> print s[0]
�
# So is the second.
>>> print s[1]
�
# But together they make up a character.
>>> print s[0:2]
á

So printing individual bytes doesn't work as expected.

>>> for c in s:print c,
... 
� � � � � � � � � � a b c d e f g � � � �

But decoding the string to unicode, then printing does.

>>> for c in s.decode('utf-8'):print c,
... 
á é í ó ú a b c d e f g ç ë

To make your code work as you expect, you need to decode the string you read from the file. Change

string = fp.read().replace('\n', '')

to

string = fp.read().replace('\n', '').decode('utf-8')

This depends on the version of Python you are using:

1. For the python 2, there was no native support for Unicode characters, so it was necessary to leave explicit, with a header such as:

# -*-coding:utf-8-*-

2. For python 3 The support is native, as it says here .

So the UTF-8 encoding already has native support.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM