处理字符串中的十进制转义

Question

I have a file of strings one per line in which non-ascii characters have been escaped with decimal code points. 我有一个字符串文件，每行一个，其中的非ASCII字符已用小数点代码转义。 One example line is: 一个示例行是：

mj\\\\195\\\\164ger

(The double backslashes are in the file exactly as printed) （文件中的双反斜杠与打印的完全一样）

I would like to process this string to produce 我想处理此字符串以产生

mjäger

. 。 Conventionally, python uses hexadecimal escapes rather than decimal escapes (eg, the above string would be written as mj\\xc3\\xa4ger , which python can decode: 按照惯例，python使用十六进制转义而不是十进制转义（例如，上面的字符串将写为mj\\xc3\\xa4ger ，python可以解码：

>>> by=b'mj\xc3\xa4ger'
>>> by.decode('utf-8')
'mjäger'

Python, however, doesn't recognize the decimal escape right away. 但是，Python无法立即识别十进制转义。

I have written a method that correctly manipulates the strings to produce hexadecimal escapes, but these escapes are themselves escaped. 我已经编写了一种方法，可以正确地操作字符串以产生十六进制转义符，但是这些转义符本身是可以转义的。 How can I get python to process these hexadecimal escapes to create the final string? 如何获取python处理这些十六进制转义以创建最终字符串？

import re

hexconst=["0","1","2","3","4","5","6","7","8","9","a","b","c","d","e","f"]
escapes=re.compile(r"\\[0-9]{3}")
def dec2hex(matchobj):
    dec=matchobj.group(0)
    dec=int(dec[1:])
    digit1=dec//16 #integer division
    digit2=dec%16 
    hex="\\x" + hexconst[digit1] + hexconst[digit2]
    return hex

line=r'mj\195\164ger'
print(escapes.sub(dec2hex,line)) #Outputs mj\xc3\xa4ger

What is the final step I'm missing to convert the output of the above from mj\\xc3\\xa4ger to mjäger ? 我缺少将上述输出从mj\\xc3\\xa4ger为mjäger的最后一步是什么？ Thanks! 谢谢！

Answer 1

It's much easier. 这要容易得多。 re.sub() can take a callback function instead of a replacement string as an argument: re.sub()可以使用回调函数代替替换字符串作为参数：

>>> import re
>>> line=r'mj\195\164ger'
>>> def replace(match):
...     return chr(int(match.group(1)))

>>> regex = re.compile(r"\\(\d{1,3})")
>>> new = regex.sub(replace, line)
>>> new
'mj\xc3\xa4ger'
>>> print new
mjäger

In Python 3, strings are Unicode strings, so if you're working with encoded input (like UTF-8 encoded content), then you need to use the proper type which is bytes : 在Python 3中，字符串是Unicode字符串，因此，如果您使用的是编码输入（如UTF-8编码的内容），则需要使用正确的类型，即bytes ：

>>> line = rb'mj\195\164ger'
>>> regex = re.compile(rb"\\(\d{1,3})")
>>> def replace(match):
...     return int(match.group(1)).to_bytes(1, byteorder="big")

>>> new = regex.sub(replace, line)
>>> new
b'mj\xc3\xa4ger'
>>> print(new.decode("utf-8"))
mjäger

处理字符串中的十进制转义

问题描述

1 个解决方案

解决方案1
0 已采纳 2014-01-23 06:58:39

处理字符串中的十进制转义

问题描述

1 个解决方案

解决方案1 0 已采纳 2014-01-23 06:58:39

解决方案1
0 已采纳 2014-01-23 06:58:39