简体   繁体   中英

How do you convert a string representation of a UTF-16 byte sequence to UTF-8 in Python?

I'm creating a program that will read .rtf files. .rtf files are encoded in ASCII, but represent non-ASCII characters with an escape sequence followed by two numbers representing a UTF-16 double-byte. For example, "これは日本語。" is represented as "\\'82\\'b1\\'82\\'ea\\'82\\'cd\\'93\\'fa\\'96\\'7b\\'8c\\'ea\\'81\\'42".

For the purposes of my program, the code page is always "cpg1252".

How do I convert the "\\'xx" sequences to a UTF-8 string? I tried playing around with the codecs, but all I got was gibberish.

You appear to have Shift-JIS data inside code-page escapes ; you can extract the marked-up bytes and decode those:

import re
from binascii import unhexlify

cp_escapes = re.compile(r"\'([0-9a-fA-F]{2})")

def extract_cp_escapes(data):
    return unhexlify(''.join(marked_bytes.findall(data)))

then decode; Shift-JIS is codepage 932 on Windows:

>>> text = r"\'82\'b1\'82\'ea\'82\'cd\'93\'fa\'96\'7b\'8c\'ea\'81\'42"
>>> extract_cp_escapes(text)
'\x82\xb1\x82\xea\x82\xcd\x93\xfa\x96{\x8c\xea\x81B'
>>> print extract_marked_bytes(text).decode('cp932')
これは日本語。

You can decode that to another codec like UTF-8 if you need to.

It may be possible that the exact code page used is also encoded in the RTF document, but I am out of time to research that.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM