Decode bad escape characters in python

Question

So I have a database with a lot of names. The names have bad characters. For example, a name in a record is JosÃ© Florés I wanted to clean this to get José Florés

I tried the following

name = "    JosÃ©     Florés "
print(name.encode('iso-8859-1',errors='ignore').decode('utf8',errors='backslashreplace')

The output messes the last name to ' José Flor\\\\xe9s '

What is the best way to solve this? The names can have any kind of unicode or hex escape sequences.

Answer 1

ftfy is a python library which fixes unicode text broken in different ways with a function named fix_text .

from ftfy import fix_text

def convert_iso_name_to_string(name):
    result = []

    for word in name.split():
        result.append(fix_text(word))
    return ' '.join(result)

name = "JosÃ© Florés"
assert convert_iso_name_to_string(name) == "José Florés"

Using the fix_text method the names can be standardized, which is an alternate way to solve the problem.

Answer 2

We'll start with an example string containing a non-ASCII character (ie, “ü” or “umlaut-u”):

s = 'Florés'

Now if we reference and print the string, it gives us essentially the same result:

>>> s
'Florés'
>>> print(s)
Florés

In contrast to the same string s in Python 2.x, in this case s is already a Unicode string, and all strings in Python 3.x are automatically Unicode. The visible difference is that s wasn't changed after we instantiated it

You can find the same here Encoding and Decoding Strings

Decode bad escape characters in python

Question

2 answers

solution1
4 ACCPTED 2019-01-03 19:57:05

solution2
-1 2019-01-03 18:44:16

Decode bad escape characters in python

Question

2 answers

solution1 4 ACCPTED 2019-01-03 19:57:05

solution2 -1 2019-01-03 18:44:16

solution1
4 ACCPTED 2019-01-03 19:57:05

solution2
-1 2019-01-03 18:44:16