简体   繁体   中英

Converting Unicode to ASCII in Python 3

I have tried a number of solutions and I have read many websites and I cannot seem to solve this. I have a file that contain message objects. Each message has a 4-byte value that is the message type, a 4-byte value that is the length and then the message data which is ASCII in Unicode. When I print to the screen it looks like ASCII. When I direct the output to a file I get Unicode so something is not right with the way I am trying to decode all this. Here is the python script:

import sys
import codecs
import encodings.idna
import unicodedata

def getHeader(fileObj):
    mstype_array = bytearray(4)
    mslen_array = bytearray(4)
    mstype = 0
    mslen = 0
    fileObj.seek(-1, 1)
    mstype_array = fileObj.read(4)
    mslen_array = fileObj.read(4)
    mstype = int.from_bytes(mstype_array, byteorder=sys.byteorder)
    mslen = int.from_bytes(mslen_array, byteorder=sys.byteorder)
    return mstype,mslen

def getMessage(fileObj, count):
    str = fileObj.read(count)#.decode("utf-8", "strict")
    return str

def getFields(msg):
    msg = codecs.decode(msg, 'utf-8')
    fields = msg.split(';')
    return fields

mstype = 0
mslen = 0
with open('../putty.log', 'rb') as f:
    while True:
        byte = f.read(1)
        if not byte:
            break
        if byte == b'\x1D':
            mstype, mslen = getHeader(f)
            print (f"Msg Type: {mstype} Msg Len: {mslen}")
            msg = getMessage(f, mslen)
            print(f"Message: {codecs.decode(msg, 'utf-8')}")
            #print(type(msg))
            fields = getFields(msg)
            print("Fields:")
            for field in fields:
                print(field)
        else:
            print (f"Char read: {byte}  {hex(ord(byte))}")

Use can use this link to get the file to decode.

It appears that sys.stdout is behaving differently when writing to the console vs writing to a file. The manual ( https://docs.python.org/3/library/sys.html#sys.stdout ) says that this is expected, but only gives details for Windows.
In any case, you are writing unicode to stdout (via print() ), which is why you get unicode in the file. You can avoid this by not decoding the message in getFields (so you could replace fields = getFields(msg) with fields = msg.split(b';') and writing to stdout using sys.stdout.buffer.write(field+b'\n') .
There are apparently some issues mixing print() and sys.stdout.buffer.write() , so Python 3: write binary to stdout respecting buffering may be worth reading.

tl;dr - try writing the bytes without decoding to unicode at all.

In short, define a custom function and use it everywhere you were calling print .

import sys

def ascii_print(txt):
    sys.stdout.buffer.write(txt.encode('ascii', errors='backslashreplace'))

ASCII is a subset of utf-8. The ACSII characters are indistinguishable from the same utf-8 encoded characters. Internally, all Python strings are raw Unicode. However, raw Unicode cannot be read in or written out. They must be encoded to some encoding first. By default, on most systems the default encoding is utf-8, which is the most common standard for encoding Unicode.

If you want to write out using a different encoding, then you must specify that encoding. I'm assuming you need the ascii encoding for some reason.

Note that the documentation for print states:

Since printed arguments are converted to text strings, print() cannot be used with binary mode file objects. For these, use file.write(...) instead.

Now if you are redirecting stdout , you can call write() in sys.stdout directly. However, as the docs explain there:

To write or read binary data from/to the standard streams, use the underlying binary buffer object. For example, to write bytes to stdout , use sys.stdout.buffer.write(b'abc') .

Therefore, rather than the line print(f"Message: {codecs.decode(msg, 'utf-8')}") , you might do:

ascii_msg = f"Message: {codecs.decode(msg, 'utf-8')}".encode('ascii')
sys.stdout.buffer.write(ascii_msg)

Note that I specifically called str.encode , on the string and explicitly set the ascii encoding. Also note that I encoded the entire string (including the Message: ), not just the variable passed in (which still needs to be decoded). You then need to write that ASCII encoded byte string directly to sys.stdout.buffer as is demonstrated on the second line.

The one issue with this is that its possible that the input will contain some non-ASCII characters. As is, a Unicodeerror would occur and the program would crash. To avoid this, str.encode supports a few different options for handling errors:

Other possible values are 'ignore' , 'replace' , 'xmlcharrefreplace' , 'backslashreplace' and any other name registered via codecs.register_error() .

As the target output is plain text, 'backslashreplace' is probably the best way to maintain lossless output. However, 'ignore' would work too if you don't care about preserving the non-ASCII characters.

ascii_msg = f"Message: {codecs.decode(msg, 'utf-8')}".encode('ascii', errors='backslashreplace')
sys.stdout.buffer.write(ascii_msg)

And yes, you will need to do that for every string you send to print . It might make sense to define a custom print function which keeps the code more readable:

def ascii_print(txt):
    sys.stdout.buffer.write(txt.encode('ascii', errors='backslashreplace'))

And then in your code you could just call that rather than print :

ascii_print(f"Message: {codecs.decode(msg, 'utf-8')}")

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM