简体   繁体   English

处理字符串中的十进制转义

[英]Process decimal escape in string

I have a file of strings one per line in which non-ascii characters have been escaped with decimal code points. 我有一个字符串文件,每行一个,其中的非ASCII字符已用小数点代码转义。 One example line is: 一个示例行是:

mj\\\\195\\\\164ger

(The double backslashes are in the file exactly as printed) (文件中的双反斜杠与打印的完全一样)

I would like to process this string to produce 我想处理此字符串以产生

mjäger

. Conventionally, python uses hexadecimal escapes rather than decimal escapes (eg, the above string would be written as mj\\xc3\\xa4ger , which python can decode: 按照惯例,python使用十六进制转义而不是十进制转义(例如,上面的字符串将写为mj\\xc3\\xa4ger ,python可以解码:

>>> by=b'mj\xc3\xa4ger'
>>> by.decode('utf-8')
'mjäger'

Python, however, doesn't recognize the decimal escape right away. 但是,Python无法立即识别十进制转义。

I have written a method that correctly manipulates the strings to produce hexadecimal escapes, but these escapes are themselves escaped. 我已经编写了一种方法,可以正确地操作字符串以产生十六进制转义符,但是这些转义符本身是可以转义的。 How can I get python to process these hexadecimal escapes to create the final string? 如何获取python处理这些十六进制转义以创建最终字符串?

import re

hexconst=["0","1","2","3","4","5","6","7","8","9","a","b","c","d","e","f"]
escapes=re.compile(r"\\[0-9]{3}")
def dec2hex(matchobj):
    dec=matchobj.group(0)
    dec=int(dec[1:])
    digit1=dec//16 #integer division
    digit2=dec%16 
    hex="\\x" + hexconst[digit1] + hexconst[digit2]
    return hex

line=r'mj\195\164ger'
print(escapes.sub(dec2hex,line)) #Outputs mj\xc3\xa4ger

What is the final step I'm missing to convert the output of the above from mj\\xc3\\xa4ger to mjäger ? 我缺少将上述输出从mj\\xc3\\xa4germjäger的最后一步是什么? Thanks! 谢谢!

It's much easier. 这要容易得多。 re.sub() can take a callback function instead of a replacement string as an argument: re.sub()可以使用回调函数代替替换字符串作为参数:

>>> import re
>>> line=r'mj\195\164ger'
>>> def replace(match):
...     return chr(int(match.group(1)))

>>> regex = re.compile(r"\\(\d{1,3})")
>>> new = regex.sub(replace, line)
>>> new
'mj\xc3\xa4ger'
>>> print new
mjäger

In Python 3, strings are Unicode strings, so if you're working with encoded input (like UTF-8 encoded content), then you need to use the proper type which is bytes : 在Python 3中,字符串是Unicode字符串,因此,如果您使用的是编码输入(如UTF-8编码的内容),则需要使用正确的类型,即bytes

>>> line = rb'mj\195\164ger'
>>> regex = re.compile(rb"\\(\d{1,3})")
>>> def replace(match):
...     return int(match.group(1)).to_bytes(1, byteorder="big")

>>> new = regex.sub(replace, line)
>>> new
b'mj\xc3\xa4ger'
>>> print(new.decode("utf-8"))
mjäger

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM