简体   繁体   中英

Replacing special characters in pandas dataframe

So, I have this huge DF which encoded in iso8859_15.

I have a few columns which contain names and places in Brazil, so some of them contain special characters such as "í" or "Ô".

I have the key to replace them in a dictionary {'í':'i', 'á':'a', ...}

I tried replacing it a couple of ways (below), but none of them worked.

df.replace(dictionary, regex=True, inplace=True) ###BOTH WITH AND WITHOUT REGEX AND REPLACE

Also:

df.udpate(pd.Series(dic))

None of them had the expected output, which would be for strings such as "NÍCOLAS" to become "NICOLAS".

Help?

The docs on pandas.DataFrame.replace says you have to provide a nested dictionary : the first level is the column name for which you have to provide a second dictionary with substitution pairs .

So, this should work:

>>> df=pd.DataFrame({'a': ['NÍCOLAS','asdč'], 'b': [3,4]})
>>> df
         a  b
0  NÍCOLAS  3
1     asdč  4

>>> df.replace({'a': {'č': 'c', 'Í': 'I'}}, regex=True)
         a  b
0  NICOLAS  3
1     asdc  4

Edit. Seems pandas also accepts non-nested translation dictionary. In that case, the problem is probably with character encoding , particularly if you use Python 2 . Assuming your CSV load function decoded the file characters properly (as true Unicode code-points), then you should take care your translation/substitution dictionary is also defined with Unicode characters, like this:

dictionary = {u'í': 'i', u'á': 'a'}

If you have a definition like this (and using Python 2):

dictionary = {'í': 'i', 'á': 'a'}

then the actual keys in that dictionary are multibyte strings. Which bytes (characters) they are depends on the actual source file character encoding used, but presuming you use UTF-8, you'll get:

dictionary = {'\xc3\xa1': 'a', '\xc3\xad': 'i'}

And that would explain why pandas fails to replace those chars. So, be sure to use Unicode literals in Python 2: u'this is unicode string' .

On the other hand, in Python 3, all strings are Unicode strings, and you don't have to use the u prefix (in fact unicode type from Python 2 is renamed to str in Python 3, and the old str from Python 2 is now bytes in Python 3).

replace works out of the box without specifying a specific column in Python 3.

Load Data:

df=pd.read_csv('test.csv', sep=',', low_memory=False, encoding='iso8859_15')
df

Result:

col1    col2
0   he  hello
1   Nícolas shárk
2   welcome yes

Create Dictionary:

dictionary = {'í':'i', 'á':'a'}

Replace:

df.replace(dictionary, regex=True, inplace=True)

Result:

 col1   col2
0   he  hello
1   Nicolas shark
2   welcome yes

If someone get the following error message

multiple repeat at position 2

try this df.replace(dictionary, regex=False, inplace=True)

instead of df.replace(dictionary, regex=True, inplace=True)

replace | with , in my_specific_column dataframe column

df.my_specific_column = df.my_specific_column.str.replace('|', ',')

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM