I'm creating a program that will read .rtf files. .rtf files are encoded in ASCII, but represent non-ASCII characters with an escape sequence followed by two numbers representing a UTF-16 double-byte. For example, "これは日本語。" is represented as "\\'82\\'b1\\'82\\'ea\\'82\\'cd\\'93\\'fa\\'96\\'7b\\'8c\\'ea\\'81\\'42".
For the purposes of my program, the code page is always "cpg1252".
How do I convert the "\\'xx" sequences to a UTF-8 string? I tried playing around with the codecs, but all I got was gibberish.
You appear to have Shift-JIS data inside code-page escapes ; you can extract the marked-up bytes and decode those:
import re
from binascii import unhexlify
cp_escapes = re.compile(r"\'([0-9a-fA-F]{2})")
def extract_cp_escapes(data):
return unhexlify(''.join(marked_bytes.findall(data)))
then decode; Shift-JIS is codepage 932 on Windows:
>>> text = r"\'82\'b1\'82\'ea\'82\'cd\'93\'fa\'96\'7b\'8c\'ea\'81\'42"
>>> extract_cp_escapes(text)
'\x82\xb1\x82\xea\x82\xcd\x93\xfa\x96{\x8c\xea\x81B'
>>> print extract_marked_bytes(text).decode('cp932')
これは日本語。
You can decode that to another codec like UTF-8 if you need to.
It may be possible that the exact code page used is also encoded in the RTF document, but I am out of time to research that.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.