简体   繁体   English

将 Unicode 转换为 Python 中的 ASCII 3

[英]Converting Unicode to ASCII in Python 3

I have tried a number of solutions and I have read many websites and I cannot seem to solve this.我尝试了许多解决方案,并且阅读了许多网站,但似乎无法解决这个问题。 I have a file that contain message objects.我有一个包含消息对象的文件。 Each message has a 4-byte value that is the message type, a 4-byte value that is the length and then the message data which is ASCII in Unicode.每条消息都有一个 4 字节的值是消息类型,一个 4 字节的值是长度,然后是 Unicode 中的 ASCII 的消息数据。 When I print to the screen it looks like ASCII.当我打印到屏幕上时,它看起来像 ASCII。 When I direct the output to a file I get Unicode so something is not right with the way I am trying to decode all this.当我将 output 指向一个文件时,我得到 Unicode 所以我试图解码这一切的方式有些不对劲。 Here is the python script:这是 python 脚本:

import sys
import codecs
import encodings.idna
import unicodedata

def getHeader(fileObj):
    mstype_array = bytearray(4)
    mslen_array = bytearray(4)
    mstype = 0
    mslen = 0
    fileObj.seek(-1, 1)
    mstype_array = fileObj.read(4)
    mslen_array = fileObj.read(4)
    mstype = int.from_bytes(mstype_array, byteorder=sys.byteorder)
    mslen = int.from_bytes(mslen_array, byteorder=sys.byteorder)
    return mstype,mslen

def getMessage(fileObj, count):
    str = fileObj.read(count)#.decode("utf-8", "strict")
    return str

def getFields(msg):
    msg = codecs.decode(msg, 'utf-8')
    fields = msg.split(';')
    return fields

mstype = 0
mslen = 0
with open('../putty.log', 'rb') as f:
    while True:
        byte = f.read(1)
        if not byte:
            break
        if byte == b'\x1D':
            mstype, mslen = getHeader(f)
            print (f"Msg Type: {mstype} Msg Len: {mslen}")
            msg = getMessage(f, mslen)
            print(f"Message: {codecs.decode(msg, 'utf-8')}")
            #print(type(msg))
            fields = getFields(msg)
            print("Fields:")
            for field in fields:
                print(field)
        else:
            print (f"Char read: {byte}  {hex(ord(byte))}")

Use can use this link to get the file to decode.使用可以使用此链接获取要解码的文件。

It appears that sys.stdout is behaving differently when writing to the console vs writing to a file.似乎sys.stdout在写入控制台与写入文件时表现不同。 The manual ( https://docs.python.org/3/library/sys.html#sys.stdout ) says that this is expected, but only gives details for Windows.手册( https://docs.python.org/3/library/sys.html#sys.stdout )说这是预期的,但只提供了 Windows 的详细信息。
In any case, you are writing unicode to stdout (via print() ), which is why you get unicode in the file.无论如何,您正在将 unicode 写入标准输出(通过print() ),这就是您在文件中获得 unicode 的原因。 You can avoid this by not decoding the message in getFields (so you could replace fields = getFields(msg) with fields = msg.split(b';') and writing to stdout using sys.stdout.buffer.write(field+b'\n') .您可以通过不解码getFields中的消息来避免这种情况(因此您可以将fields = getFields(msg)替换为fields = msg.split(b';')并使用sys.stdout.buffer.write(field+b'\n')
There are apparently some issues mixing print() and sys.stdout.buffer.write() , so Python 3: write binary to stdout respecting buffering may be worth reading.混合print()sys.stdout.buffer.write()显然存在一些问题,因此Python 3:将二进制写入 stdout 关于缓冲可能值得一读。

tl;dr - try writing the bytes without decoding to unicode at all. tl;dr - 尝试在不解码的情况下将字节写入 unicode。

In short, define a custom function and use it everywhere you were calling print .简而言之,定义一个自定义 function 并在您调用print的任何地方使用它。

import sys

def ascii_print(txt):
    sys.stdout.buffer.write(txt.encode('ascii', errors='backslashreplace'))

ASCII is a subset of utf-8. ASCII 是 utf-8 的子集。 The ACSII characters are indistinguishable from the same utf-8 encoded characters. ACSII 字符与相同的 utf-8 编码字符无法区分。 Internally, all Python strings are raw Unicode.在内部,所有 Python 字符串都是原始 Unicode。 However, raw Unicode cannot be read in or written out.但是,无法读取或写入原始 Unicode。 They must be encoded to some encoding first.它们必须首先编码为某种编码。 By default, on most systems the default encoding is utf-8, which is the most common standard for encoding Unicode.默认情况下,在大多数系统上,默认编码是 utf-8,这是编码 Unicode 的最常见标准。

If you want to write out using a different encoding, then you must specify that encoding.如果要使用不同的编码写出,则必须指定该编码。 I'm assuming you need the ascii encoding for some reason.我假设您出于某种原因需要ascii编码。

Note that the documentation for print states:请注意,打印状态的文档:

Since printed arguments are converted to text strings, print() cannot be used with binary mode file objects.由于打印的 arguments 被转换为文本字符串,因此print()不能与二进制模式文件对象一起使用。 For these, use file.write(...) instead.对于这些,请改用file.write(...)

Now if you are redirecting stdout , you can call write() in sys.stdout directly.现在,如果您正在重定向stdout ,您可以直接在sys.stdout中调用write() However, as the docs explain there:但是,正如文档在那里解释的那样:

To write or read binary data from/to the standard streams, use the underlying binary buffer object.要从/向标准流写入或读取二进制数据,请使用底层二进制buffer object。 For example, to write bytes to stdout , use sys.stdout.buffer.write(b'abc') .例如,要将字节写入stdout ,请使用sys.stdout.buffer.write(b'abc')

Therefore, rather than the line print(f"Message: {codecs.decode(msg, 'utf-8')}") , you might do:因此,而不是行print(f"Message: {codecs.decode(msg, 'utf-8')}") ,您可以这样做:

ascii_msg = f"Message: {codecs.decode(msg, 'utf-8')}".encode('ascii')
sys.stdout.buffer.write(ascii_msg)

Note that I specifically called str.encode , on the string and explicitly set the ascii encoding.请注意,我在字符串上专门调用了 str.encode并显式设置了ascii编码。 Also note that I encoded the entire string (including the Message: ), not just the variable passed in (which still needs to be decoded).另请注意,我编码了整个字符串(包括Message: ),而不仅仅是传入的变量(仍然需要解码)。 You then need to write that ASCII encoded byte string directly to sys.stdout.buffer as is demonstrated on the second line.然后,您需要将该 ASCII 编码字节字符串直接写入sys.stdout.buffer ,如第二行所示。

The one issue with this is that its possible that the input will contain some non-ASCII characters.这样做的一个问题是输入可能包含一些非 ASCII 字符。 As is, a Unicodeerror would occur and the program would crash.照原样,会发生Unicodeerror并且程序会崩溃。 To avoid this, str.encode supports a few different options for handling errors:为避免这种情况, str.encode支持几种不同的错误处理选项:

Other possible values are 'ignore' , 'replace' , 'xmlcharrefreplace' , 'backslashreplace' and any other name registered via codecs.register_error() .其他可能的值是'ignore''replace''xmlcharrefreplace''backslashreplace'和通过codecs.register_error()注册的任何其他名称。

As the target output is plain text, 'backslashreplace' is probably the best way to maintain lossless output.由于目标 output 是纯文本, 'backslashreplace'可能是保持无损 output 的最佳方法。 However, 'ignore' would work too if you don't care about preserving the non-ASCII characters.但是,如果您不关心保留非 ASCII 字符, 'ignore'也可以。

ascii_msg = f"Message: {codecs.decode(msg, 'utf-8')}".encode('ascii', errors='backslashreplace')
sys.stdout.buffer.write(ascii_msg)

And yes, you will need to do that for every string you send to print .是的,您需要为发送到print的每个字符串执行此操作。 It might make sense to define a custom print function which keeps the code more readable:定义一个自定义打印 function 可能是有意义的,它使代码更具可读性:

def ascii_print(txt):
    sys.stdout.buffer.write(txt.encode('ascii', errors='backslashreplace'))

And then in your code you could just call that rather than print :然后在您的代码中,您可以只调用它而不是print

ascii_print(f"Message: {codecs.decode(msg, 'utf-8')}")

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM