[英]python replace unicode characters
I wrote a program to read in Windows DNS debugging log, but inside always got some funny characters in the domain field. 我写了一个程序来读取Windows DNS调试日志,但内部总是在域字段中有一些有趣的字符。
Below is one of the example: 以下是其中一个示例:
(13)\xc2\xb5\xc2\xb1\xc2\xbe\xc3\xa2p\xc3\xb4\xc2\x8d(5)example(3)com(0)'
I want to replace all the \\x..
with a ?
我想,以取代所有
\\x..
用?
I explicitly type \\xc2 as follows works 我明确地输入\\ xc2如下工作
line = '(13)\xc2\xb5\xc2\xb1\xc2\xbe\xc3\xa2p\xc3\xb4\xc2\x8d(5)example(3)com(0)'
re.sub('\\\xc2', '?', line)
result: '(13)?\xb5?\xb1?\xbe\xc3\xa2p\xc3\xb4?\x8d(5)example(3)com(0)'
But its not working if I write as follow: 但如果我写如下,它就无法工作:
re.sub('\\\\\\x..', '?', line)
How I can write a regular expression to replace them all? 我如何编写正则表达式来替换它们?
There are better tools for this job than regex, you could try for example: 这个工作有比正则表达式更好的工具,你可以尝试例如:
>>> line
'(13)\xc2\xb5\xc2\xb1\xc2\xbe\xc3\xa2p\xc3\xb4\xc2\x8d(5)example(3)com(0)'
>>> line.decode('ascii', 'ignore')
u'(13)p(5)example(3)com(0)'
That skips non-ascii characters. 这会跳过非ascii字符。 Or with replace, you can swap them for a '?'
或者使用替换,您可以将它们换成'?' placeholder:
占位符:
>>> print line.decode('ascii', 'replace')
(13)��������p����(5)example(3)com(0)
But the best solution is to find out what erroneous encoding/decoding caused the mojibake to happen in the first place, so you can recover data by using the correct code pages. 但最好的解决方案是找出错误的编码/解码首先导致mojibake发生的情况,这样您就可以使用正确的代码页来恢复数据。
There is an excellent answer about unbaking emojibake here . 有一个关于unbaking emojibake一个优秀的答案在这里 。 Note that it's an inexact science, and a lot of the crucial information is actually in the comment thread under that answer.
请注意,这是一个不精确的科学,很多关键信息实际上都在该答案下的评论主题中。
what about this? 那这个呢?
line = '(13)\xc2\xb5\xc2\xb1\xc2\xbe\xc3\xa2p\xc3\xb4\xc2\x8d(5)example(3)com(0)'
pattern = r'\\x.+'
re.sub(pattern, r'?', line)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.