So, I have this huge DF which encoded in iso8859_15.
I have a few columns which contain names and places in Brazil, so some of them contain special characters such as "í" or "Ô".
I have the key to replace them in a dictionary {'í':'i', 'á':'a', ...}
I tried replacing it a couple of ways (below), but none of them worked.
df.replace(dictionary, regex=True, inplace=True) ###BOTH WITH AND WITHOUT REGEX AND REPLACE
Also:
df.udpate(pd.Series(dic))
None of them had the expected output, which would be for strings such as "NÍCOLAS" to become "NICOLAS".
Help?
The docs on pandas.DataFrame.replace
says you have to provide a nested dictionary : the first level is the column name for which you have to provide a second dictionary with substitution pairs .
So, this should work:
>>> df=pd.DataFrame({'a': ['NÍCOLAS','asdč'], 'b': [3,4]})
>>> df
a b
0 NÍCOLAS 3
1 asdč 4
>>> df.replace({'a': {'č': 'c', 'Í': 'I'}}, regex=True)
a b
0 NICOLAS 3
1 asdc 4
Edit. Seems pandas
also accepts non-nested translation dictionary. In that case, the problem is probably with character encoding , particularly if you use Python 2 . Assuming your CSV load function decoded the file characters properly (as true Unicode code-points), then you should take care your translation/substitution dictionary is also defined with Unicode characters, like this:
dictionary = {u'í': 'i', u'á': 'a'}
If you have a definition like this (and using Python 2):
dictionary = {'í': 'i', 'á': 'a'}
then the actual keys in that dictionary are multibyte strings. Which bytes (characters) they are depends on the actual source file character encoding used, but presuming you use UTF-8, you'll get:
dictionary = {'\xc3\xa1': 'a', '\xc3\xad': 'i'}
And that would explain why pandas
fails to replace those chars. So, be sure to use Unicode literals in Python 2: u'this is unicode string'
.
On the other hand, in Python 3, all strings are Unicode strings, and you don't have to use the u
prefix (in fact unicode
type from Python 2 is renamed to str
in Python 3, and the old str
from Python 2 is now bytes
in Python 3).
replace
works out of the box without specifying a specific column in Python 3.
Load Data:
df=pd.read_csv('test.csv', sep=',', low_memory=False, encoding='iso8859_15')
df
Result:
col1 col2
0 he hello
1 Nícolas shárk
2 welcome yes
Create Dictionary:
dictionary = {'í':'i', 'á':'a'}
Replace:
df.replace(dictionary, regex=True, inplace=True)
Result:
col1 col2
0 he hello
1 Nicolas shark
2 welcome yes
If someone get the following error message
multiple repeat at position 2
try this df.replace(dictionary, regex=False, inplace=True)
instead of df.replace(dictionary, regex=True, inplace=True)
replace |
with ,
in my_specific_column
dataframe column
df.my_specific_column = df.my_specific_column.str.replace('|', ',')
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.