简体   繁体   English

在python中解码错误的转义字符

[英]Decode bad escape characters in python

So I have a database with a lot of names.所以我有一个有很多名字的数据库。 The names have bad characters.名字有坏字符。 For example, a name in a record is José Florés I wanted to clean this to get José Florés例如,记录中的名字是José Florés我想清理它以获得José Florés

I tried the following我尝试了以下

name = "    José     Florés "
print(name.encode('iso-8859-1',errors='ignore').decode('utf8',errors='backslashreplace')

The output messes the last name to ' José Flor\\\\xe9s '输出将姓氏' José Flor\\\\xe9s '' José Flor\\\\xe9s '

What is the best way to solve this?解决这个问题的最佳方法是什么? The names can have any kind of unicode or hex escape sequences.名称可以有任何类型的 unicode 或 hex 转义序列。

ftfy is a python library which fixes unicode text broken in different ways with a function named fix_text . ftfy是一个 python 库,它使用名为fix_text的函数修复以不同方式损坏的 unicode 文本。

from ftfy import fix_text

def convert_iso_name_to_string(name):
    result = []

    for word in name.split():
        result.append(fix_text(word))
    return ' '.join(result)

name = "José Florés"
assert convert_iso_name_to_string(name) == "José Florés"

Using the fix_text method the names can be standardized, which is an alternate way to solve the problem.使用fix_text方法可以标准化名称,这是解决问题的另一种方法。

We'll start with an example string containing a non-ASCII character (ie, “ü” or “umlaut-u”):我们将从一个包含非 ASCII 字符(即“ü”或“变音-u”)的示例字符串开始:

s = 'Florés'

Now if we reference and print the string, it gives us essentially the same result:现在,如果我们引用并打印字符串,它会给我们本质上相同的结果:

>>> s
'Florés'
>>> print(s)
Florés

In contrast to the same string s in Python 2.x, in this case s is already a Unicode string, and all strings in Python 3.x are automatically Unicode.与 Python 2.x 中相同的字符串 s 不同,在这种情况下 s 已经是一个 Unicode 字符串,并且 Python 3.x 中的所有字符串都自动是 Unicode。 The visible difference is that s wasn't changed after we instantiated it明显的区别是s在我们实例化之后没有改变

You can find the same here Encoding and Decoding Strings您可以在此处找到相同的编码和解码字符串

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM