简体   繁体   English

python替换unicode字符

[英]python replace unicode characters

I wrote a program to read in Windows DNS debugging log, but inside always got some funny characters in the domain field. 我写了一个程序来读取Windows DNS调试日志,但内部总是在域字段中有一些有趣的字符。

Below is one of the example: 以下是其中一个示例:

(13)\xc2\xb5\xc2\xb1\xc2\xbe\xc3\xa2p\xc3\xb4\xc2\x8d(5)example(3)com(0)'

I want to replace all the \\x.. with a ? 我想,以取代所有\\x..?

I explicitly type \\xc2 as follows works 我明确地输入\\ xc2如下工作

line = '(13)\xc2\xb5\xc2\xb1\xc2\xbe\xc3\xa2p\xc3\xb4\xc2\x8d(5)example(3)com(0)'
re.sub('\\\xc2', '?', line)
result: '(13)?\xb5?\xb1?\xbe\xc3\xa2p\xc3\xb4?\x8d(5)example(3)com(0)'

But its not working if I write as follow: 但如果我写如下,它就无法工作:

re.sub('\\\\\\x..', '?', line)

How I can write a regular expression to replace them all? 我如何编写正则表达式来替换它们?

There are better tools for this job than regex, you could try for example: 这个工作有比正则表达式更好的工具,你可以尝试例如:

>>> line
'(13)\xc2\xb5\xc2\xb1\xc2\xbe\xc3\xa2p\xc3\xb4\xc2\x8d(5)example(3)com(0)'
>>> line.decode('ascii', 'ignore')
u'(13)p(5)example(3)com(0)'

That skips non-ascii characters. 这会跳过非ascii字符。 Or with replace, you can swap them for a '?' 或者使用替换,您可以将它们换成'?' placeholder: 占位符:

>>> print line.decode('ascii', 'replace')
(13)��������p����(5)example(3)com(0)

But the best solution is to find out what erroneous encoding/decoding caused the mojibake to happen in the first place, so you can recover data by using the correct code pages. 但最好的解决方案是找出错误的编码/解码首先导致mojibake发生的情况,这样您就可以使用正确的代码页来恢复数据。

There is an excellent answer about unbaking emojibake here . 有一个关于unbaking emojibake一个优秀的答案在这里 Note that it's an inexact science, and a lot of the crucial information is actually in the comment thread under that answer. 请注意,这是一个不精确的科学,很多关键信息实际上都在该答案下的评论主题中。

what about this? 那这个呢?

line = '(13)\xc2\xb5\xc2\xb1\xc2\xbe\xc3\xa2p\xc3\xb4\xc2\x8d(5)example(3)com(0)'

pattern = r'\\x.+'
re.sub(pattern, r'?', line)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM