替换 pandas dataframe 中的特殊字符

Question

So, I have this huge DF which encoded in iso8859_15.所以，我有这个用 iso8859_15 编码的巨大 DF。

I have a few columns which contain names and places in Brazil, so some of them contain special characters such as "í" or "Ô".我有几列包含巴西的名称和地点，因此其中一些包含特殊字符，例如“í”或“Ô”。

I have the key to replace them in a dictionary {'í':'i', 'á':'a', ...}我有在字典中替换它们的钥匙 {'í':'i', 'á':'a', ...}

I tried replacing it a couple of ways (below), but none of them worked.我尝试用几种方法（如下）替换它，但没有一个起作用。

df.replace(dictionary, regex=True, inplace=True) ###BOTH WITH AND WITHOUT REGEX AND REPLACE

Also:还：

df.udpate(pd.Series(dic))

None of them had the expected output, which would be for strings such as "NÍCOLAS" to become "NICOLAS".他们都没有预期的 output，这将使诸如“NÍCOLAS”之类的字符串变为“NICOLAS”。

Help?帮助？

Answer 1

The docs on pandas.DataFrame.replace says you have to provide a nested dictionary : the first level is the column name for which you have to provide a second dictionary with substitution pairs . pandas.DataFrame.replace上的文档说您必须提供一个嵌套字典：第一级是列名，您必须为其提供带有替换对的第二个字典。

So, this should work:所以，这应该有效：

>>> df=pd.DataFrame({'a': ['NÍCOLAS','asdč'], 'b': [3,4]})
>>> df
         a  b
0  NÍCOLAS  3
1     asdč  4

>>> df.replace({'a': {'č': 'c', 'Í': 'I'}}, regex=True)
         a  b
0  NICOLAS  3
1     asdc  4

Edit.编辑。 Seems pandas also accepts non-nested translation dictionary.似乎pandas也接受非嵌套翻译字典。 In that case, the problem is probably with character encoding , particularly if you use Python 2 .在这种情况下，问题可能出在字符编码上，尤其是当您使用Python 2 时。 Assuming your CSV load function decoded the file characters properly (as true Unicode code-points), then you should take care your translation/substitution dictionary is also defined with Unicode characters, like this:假设您的 CSV 加载函数正确解码了文件字符（作为真正的 Unicode 代码点），那么您应该注意您的翻译/替换字典也是用 Unicode 字符定义的，如下所示：

dictionary = {u'í': 'i', u'á': 'a'}

If you have a definition like this (and using Python 2):如果您有这样的定义（并使用 Python 2）：

dictionary = {'í': 'i', 'á': 'a'}

then the actual keys in that dictionary are multibyte strings.那么该字典中的实际键是多字节字符串。 Which bytes (characters) they are depends on the actual source file character encoding used, but presuming you use UTF-8, you'll get:它们是哪些字节（字符）取决于所使用的实际源文件字符编码，但假设您使用 UTF-8，您将得到：

dictionary = {'\xc3\xa1': 'a', '\xc3\xad': 'i'}

And that would explain why pandas fails to replace those chars.这将解释为什么pandas无法替换这些字符。 So, be sure to use Unicode literals in Python 2: u'this is unicode string' .所以，一定要在 Python 2 中使用 Unicode 文字： u'this is unicode string' 。

On the other hand, in Python 3, all strings are Unicode strings, and you don't have to use the u prefix (in fact unicode type from Python 2 is renamed to str in Python 3, and the old str from Python 2 is now bytes in Python 3).另一方面，在 Python 3 中，所有字符串都是 Unicode 字符串，您不必使用u前缀（实际上 Python 2 中的unicode类型在 Python 3 中重命名为str ，而 Python 2 中的旧str是现在 Python 3 中的bytes ）。

Answer 2

replace works out of the box without specifying a specific column in Python 3. replace开箱即用，无需在 Python 3 中指定特定列。

Load Data:加载数据：

df=pd.read_csv('test.csv', sep=',', low_memory=False, encoding='iso8859_15')
df

Result:结果：

col1    col2
0   he  hello
1   Nícolas shárk
2   welcome yes

Create Dictionary:创建字典：

dictionary = {'í':'i', 'á':'a'}

Replace:替换：

df.replace(dictionary, regex=True, inplace=True)

Result:结果：

 col1   col2
0   he  hello
1   Nicolas shark
2   welcome yes

Answer 3

If someone get the following error message如果有人收到以下错误消息

multiple repeat at position 2在位置 2 多次重复

try this df.replace(dictionary, regex=False, inplace=True)试试这个df.replace(dictionary, regex=False, inplace=True)

instead of df.replace(dictionary, regex=True, inplace=True)而不是df.replace(dictionary, regex=True, inplace=True)

Answer 4

replace |替换| with , in my_specific_column dataframe column与,在my_specific_column dataframe 列

df.my_specific_column = df.my_specific_column.str.replace('|', ',')

替换 pandas dataframe 中的特殊字符

问题描述

4 个解决方案

解决方案1
12 已采纳 2017-08-09 17:01:27

解决方案2
7 2017-08-09 17:27:57

解决方案3
1 2019-02-01 16:14:00

解决方案4
0 2022-08-31 15:13:40

替换 pandas dataframe 中的特殊字符

问题描述

4 个解决方案

解决方案1 12 已采纳 2017-08-09 17:01:27

解决方案2 7 2017-08-09 17:27:57

解决方案3 1 2019-02-01 16:14:00

解决方案4 0 2022-08-31 15:13:40

解决方案1
12 已采纳 2017-08-09 17:01:27

解决方案2
7 2017-08-09 17:27:57

解决方案3
1 2019-02-01 16:14:00

解决方案4
0 2022-08-31 15:13:40