python替换unicode字符

Question

I wrote a program to read in Windows DNS debugging log, but inside always got some funny characters in the domain field. 我写了一个程序来读取Windows DNS调试日志，但内部总是在域字段中有一些有趣的字符。

Below is one of the example: 以下是其中一个示例：

(13)\xc2\xb5\xc2\xb1\xc2\xbe\xc3\xa2p\xc3\xb4\xc2\x8d(5)example(3)com(0)'

I want to replace all the \\x.. with a ? 我想，以取代所有\\x..用?

I explicitly type \\xc2 as follows works 我明确地输入\\ xc2如下工作

line = '(13)\xc2\xb5\xc2\xb1\xc2\xbe\xc3\xa2p\xc3\xb4\xc2\x8d(5)example(3)com(0)'
re.sub('\\\xc2', '?', line)
result: '(13)?\xb5?\xb1?\xbe\xc3\xa2p\xc3\xb4?\x8d(5)example(3)com(0)'

But its not working if I write as follow: 但如果我写如下，它就无法工作：

re.sub('\\\\\\x..', '?', line)

How I can write a regular expression to replace them all? 我如何编写正则表达式来替换它们？

Answer 1

There are better tools for this job than regex, you could try for example: 这个工作有比正则表达式更好的工具，你可以尝试例如：

>>> line
'(13)\xc2\xb5\xc2\xb1\xc2\xbe\xc3\xa2p\xc3\xb4\xc2\x8d(5)example(3)com(0)'
>>> line.decode('ascii', 'ignore')
u'(13)p(5)example(3)com(0)'

That skips non-ascii characters. 这会跳过非ascii字符。 Or with replace, you can swap them for a '?' 或者使用替换，您可以将它们换成'？' placeholder: 占位符：

>>> print line.decode('ascii', 'replace')
(13)��������p����(5)example(3)com(0)

But the best solution is to find out what erroneous encoding/decoding caused the mojibake to happen in the first place, so you can recover data by using the correct code pages. 但最好的解决方案是找出错误的编码/解码首先导致mojibake发生的情况，这样您就可以使用正确的代码页来恢复数据。

There is an excellent answer about unbaking emojibake here . 有一个关于unbaking emojibake一个优秀的答案在这里。 Note that it's an inexact science, and a lot of the crucial information is actually in the comment thread under that answer. 请注意，这是一个不精确的科学，很多关键信息实际上都在该答案下的评论主题中。

Answer 2

what about this? 那这个呢？

line = '(13)\xc2\xb5\xc2\xb1\xc2\xbe\xc3\xa2p\xc3\xb4\xc2\x8d(5)example(3)com(0)'

pattern = r'\\x.+'
re.sub(pattern, r'?', line)

python替换unicode字符

问题描述

2 个解决方案

解决方案1
3 已采纳 2016-09-28 15:32:04

解决方案2
-2 2016-09-28 15:46:44

python替换unicode字符

问题描述

2 个解决方案

解决方案1 3 已采纳 2016-09-28 15:32:04

解决方案2 -2 2016-09-28 15:46:44

解决方案1
3 已采纳 2016-09-28 15:32:04

解决方案2
-2 2016-09-28 15:46:44