简体   繁体   English

如何在 Python 中的字符串中替换无效的 unicode 字符?

[英]How to replace invalid unicode characters in a string in Python?

As far as I know it is the concept of python to have only valid characters in a string, but in my case the OS will deliver strings with invalid encodings in path names I have to deal with.据我所知,python 的概念是字符串中只有有效字符,但在我的情况下,操作系统将在我必须处理的路径名中提供带有无效编码的字符串。 So I end up with strings that contain characters that are non-unicode.所以我最终得到包含非 unicode 字符的字符串。

In order to correct these problems I need to display these strings somehow.为了纠正这些问题,我需要以某种方式显示这些字符串。 Unfortunately I can not print them because they contain non-unicode characters.不幸的是,我无法打印它们,因为它们包含非 unicode 字符。 Is there an elegant way to replace these characters somehow to at least get some idea of the content of the string?有没有一种优雅的方法来以某种方式替换这些字符,至少可以对字符串的内容有所了解?

My idea would be to process these strings character by character and check if the character stored is actually valid unicode.我的想法是逐个字符地处理这些字符串并检查存储的字符是否实际上是有效的 unicode。 In case of an invalid character I would like to use a certain unicode symbol.如果出现无效字符,我想使用某个 unicode 符号。 But how can I do this?但是我该怎么做呢? Using codecs seems not to be suitable for that purpose: I already have a string, returned by the operating system, and not a byte array.使用codecs似乎不适合这个目的:我已经有了一个由操作系统返回的字符串,而不是一个字节数组。 Converting a string to byte array seems to involve decoding which will fail in my case of course.将字符串转换为字节数组似乎涉及解码,这在我的情况下当然会失败。 So it seems that I'm stuck.所以看起来我被卡住了。

Do you have an tips for me how to be able to create such a replacement string?您对我如何能够创建这样的替换字符串有什么建议吗?

If you have a bytestring (undecoded data), use the 'replace' error handler.如果您有一个字节串(未解码的数据),请使用'replace'错误处理程序。 For example, if your data is (mostly) UTF-8 encoded, then you could use:例如,如果您的数据(大部分)是 UTF-8 编码的,那么您可以使用:

decoded_unicode = bytestring.decode('utf-8', 'replace')

and U+FFFD REPLACEMENT CHARACTER characters will be inserted for any bytes that can't be decoded.U+FFFD REPLACEMENT CHARACTER字符将插入任何无法解码的字节。

If you wanted to use a different replacement character, it is easy enough to replace these afterwards:如果您想使用不同的替换字符,之后替换它们很容易:

decoded_unicode = decoded_unicode.replace(u'\ufffd', '#')

Demo:演示:

>>> bytestring = 'F\xc3\xb8\xc3\xb6\xbbB\xc3\xa5r'
>>> bytestring.decode('utf8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/mjpieters/Development/venvs/stackoverflow-2.7/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xbb in position 5: invalid start byte
>>> bytestring.decode('utf8', 'replace')
u'F\xf8\xf6\ufffdB\xe5r'
>>> print bytestring.decode('utf8', 'replace')
Føö�Bår

Thanks to you for your comments.感谢您的评论。 This way I was able to implement a better solution:通过这种方式,我能够实施更好的解决方案:

    try:
        s2 = codecs.encode(s, "utf-8")
        return (True, s, None)
    except Exception as e:
        ret = codecs.decode(codecs.encode(s, "utf-8", "replace"), "utf-8")
        return (False, ret, e)

Please share any improvements on that solution.请分享对该解决方案的任何改进。 Thank you!谢谢!

You have not given an example.你没有举个例子。 Therefore, I have considered one example to answer your question.因此,我考虑了一个例子来回答你的问题。

x='This is a cat which looks good 😊'
print x
x.replace('😊','')

The output is:输出是:

This is a cat which looks good 😊
'This is a cat which looks good '

The right way to do it (at least in python2) is to use unicodedata.normalize:正确的做法(至少在 python2 中)是使用 unicodedata.normalize:

unicodedata.normalize('NFKD', text).encode('utf-8', 'ignore')

decode('utf-8', 'ignore') will just raise exception. decode('utf-8', 'ignore') 只会引发异常。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM