简体   繁体   English

替换 pandas dataframe 中的特殊字符

[英]Replacing special characters in pandas dataframe

So, I have this huge DF which encoded in iso8859_15.所以,我有这个用 iso8859_15 编码的巨大 DF。

I have a few columns which contain names and places in Brazil, so some of them contain special characters such as "í" or "Ô".我有几列包含巴西的名称和地点,因此其中一些包含特殊字符,例如“í”或“Ô”。

I have the key to replace them in a dictionary {'í':'i', 'á':'a', ...}我有在字典中替换它们的钥匙 {'í':'i', 'á':'a', ...}

I tried replacing it a couple of ways (below), but none of them worked.我尝试用几种方法(如下)替换它,但没有一个起作用。

df.replace(dictionary, regex=True, inplace=True) ###BOTH WITH AND WITHOUT REGEX AND REPLACE

Also:还:

df.udpate(pd.Series(dic))

None of them had the expected output, which would be for strings such as "NÍCOLAS" to become "NICOLAS".他们都没有预期的 output,这将使诸如“NÍCOLAS”之类的字符串变为“NICOLAS”。

Help?帮助?

The docs on pandas.DataFrame.replace says you have to provide a nested dictionary : the first level is the column name for which you have to provide a second dictionary with substitution pairs . pandas.DataFrame.replace上的文档说您必须提供一个嵌套字典第一级是列名,您必须为其提供带有替换对第二个字典

So, this should work:所以,这应该有效:

>>> df=pd.DataFrame({'a': ['NÍCOLAS','asdč'], 'b': [3,4]})
>>> df
         a  b
0  NÍCOLAS  3
1     asdč  4

>>> df.replace({'a': {'č': 'c', 'Í': 'I'}}, regex=True)
         a  b
0  NICOLAS  3
1     asdc  4

Edit.编辑。 Seems pandas also accepts non-nested translation dictionary.似乎pandas也接受非嵌套翻译字典。 In that case, the problem is probably with character encoding , particularly if you use Python 2 .在这种情况下,问题可能出在字符编码上尤其是当您使用Python 2 时 Assuming your CSV load function decoded the file characters properly (as true Unicode code-points), then you should take care your translation/substitution dictionary is also defined with Unicode characters, like this:假设您的 CSV 加载函数正确解码了文件字符(作为真正的 Unicode 代码点),那么您应该注意您的翻译/替换字典也是用 Unicode 字符定义的,如下所示:

dictionary = {u'í': 'i', u'á': 'a'}

If you have a definition like this (and using Python 2):如果您有这样的定义(并使用 Python 2):

dictionary = {'í': 'i', 'á': 'a'}

then the actual keys in that dictionary are multibyte strings.那么该字典中的实际键是多字节字符串。 Which bytes (characters) they are depends on the actual source file character encoding used, but presuming you use UTF-8, you'll get:它们是哪些字节(字符)取决于所使用的实际源文件字符编码,但假设您使用 UTF-8,您将得到:

dictionary = {'\xc3\xa1': 'a', '\xc3\xad': 'i'}

And that would explain why pandas fails to replace those chars.这将解释为什么pandas无法替换这些字符。 So, be sure to use Unicode literals in Python 2: u'this is unicode string' .所以,一定要在 Python 2 中使用 Unicode 文字: u'this is unicode string'

On the other hand, in Python 3, all strings are Unicode strings, and you don't have to use the u prefix (in fact unicode type from Python 2 is renamed to str in Python 3, and the old str from Python 2 is now bytes in Python 3).另一方面,在 Python 3 中,所有字符串都是 Unicode 字符串,您不必使用u前缀(实际上 Python 2 中的unicode类型在 Python 3 中重命名为str ,而 Python 2 中的旧str是现在 Python 3 中的bytes )。

replace works out of the box without specifying a specific column in Python 3. replace开箱即用,无需在 Python 3 中指定特定列。

Load Data:加载数据:

df=pd.read_csv('test.csv', sep=',', low_memory=False, encoding='iso8859_15')
df

Result:结果:

col1    col2
0   he  hello
1   Nícolas shárk
2   welcome yes

Create Dictionary:创建字典:

dictionary = {'í':'i', 'á':'a'}

Replace:替换:

df.replace(dictionary, regex=True, inplace=True)

Result:结果:

 col1   col2
0   he  hello
1   Nicolas shark
2   welcome yes

If someone get the following error message如果有人收到以下错误消息

multiple repeat at position 2在位置 2 多次重复

try this df.replace(dictionary, regex=False, inplace=True)试试这个df.replace(dictionary, regex=False, inplace=True)

instead of df.replace(dictionary, regex=True, inplace=True)而不是df.replace(dictionary, regex=True, inplace=True)

replace |替换| with , in my_specific_column dataframe column,my_specific_column dataframe 列

df.my_specific_column = df.my_specific_column.str.replace('|', ',')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM