[英]Replacing special characters in pandas dataframe
So, I have this huge DF which encoded in iso8859_15.所以,我有这个用 iso8859_15 编码的巨大 DF。
I have a few columns which contain names and places in Brazil, so some of them contain special characters such as "í" or "Ô".我有几列包含巴西的名称和地点,因此其中一些包含特殊字符,例如“í”或“Ô”。
I have the key to replace them in a dictionary {'í':'i', 'á':'a', ...}我有在字典中替换它们的钥匙 {'í':'i', 'á':'a', ...}
I tried replacing it a couple of ways (below), but none of them worked.我尝试用几种方法(如下)替换它,但没有一个起作用。
df.replace(dictionary, regex=True, inplace=True) ###BOTH WITH AND WITHOUT REGEX AND REPLACE
Also:还:
df.udpate(pd.Series(dic))
None of them had the expected output, which would be for strings such as "NÍCOLAS" to become "NICOLAS".他们都没有预期的 output,这将使诸如“NÍCOLAS”之类的字符串变为“NICOLAS”。
Help?帮助?
The docs on pandas.DataFrame.replace
says you have to provide a nested dictionary : the first level is the column name for which you have to provide a second dictionary with substitution pairs . pandas.DataFrame.replace
上的文档说您必须提供一个嵌套字典:第一级是列名,您必须为其提供带有替换对的第二个字典。
So, this should work:所以,这应该有效:
>>> df=pd.DataFrame({'a': ['NÍCOLAS','asdč'], 'b': [3,4]})
>>> df
a b
0 NÍCOLAS 3
1 asdč 4
>>> df.replace({'a': {'č': 'c', 'Í': 'I'}}, regex=True)
a b
0 NICOLAS 3
1 asdc 4
Edit.编辑。 Seems
pandas
also accepts non-nested translation dictionary.似乎
pandas
也接受非嵌套翻译字典。 In that case, the problem is probably with character encoding , particularly if you use Python 2 .在这种情况下,问题可能出在字符编码上,尤其是当您使用Python 2 时。 Assuming your CSV load function decoded the file characters properly (as true Unicode code-points), then you should take care your translation/substitution dictionary is also defined with Unicode characters, like this:
假设您的 CSV 加载函数正确解码了文件字符(作为真正的 Unicode 代码点),那么您应该注意您的翻译/替换字典也是用 Unicode 字符定义的,如下所示:
dictionary = {u'í': 'i', u'á': 'a'}
If you have a definition like this (and using Python 2):如果您有这样的定义(并使用 Python 2):
dictionary = {'í': 'i', 'á': 'a'}
then the actual keys in that dictionary are multibyte strings.那么该字典中的实际键是多字节字符串。 Which bytes (characters) they are depends on the actual source file character encoding used, but presuming you use UTF-8, you'll get:
它们是哪些字节(字符)取决于所使用的实际源文件字符编码,但假设您使用 UTF-8,您将得到:
dictionary = {'\xc3\xa1': 'a', '\xc3\xad': 'i'}
And that would explain why pandas
fails to replace those chars.这将解释为什么
pandas
无法替换这些字符。 So, be sure to use Unicode literals in Python 2: u'this is unicode string'
.所以,一定要在 Python 2 中使用 Unicode 文字:
u'this is unicode string'
。
On the other hand, in Python 3, all strings are Unicode strings, and you don't have to use the u
prefix (in fact unicode
type from Python 2 is renamed to str
in Python 3, and the old str
from Python 2 is now bytes
in Python 3).另一方面,在 Python 3 中,所有字符串都是 Unicode 字符串,您不必使用
u
前缀(实际上 Python 2 中的unicode
类型在 Python 3 中重命名为str
,而 Python 2 中的旧str
是现在 Python 3 中的bytes
)。
replace
works out of the box without specifying a specific column in Python 3. replace
开箱即用,无需在 Python 3 中指定特定列。
Load Data:加载数据:
df=pd.read_csv('test.csv', sep=',', low_memory=False, encoding='iso8859_15')
df
Result:结果:
col1 col2
0 he hello
1 Nícolas shárk
2 welcome yes
Create Dictionary:创建字典:
dictionary = {'í':'i', 'á':'a'}
Replace:替换:
df.replace(dictionary, regex=True, inplace=True)
Result:结果:
col1 col2
0 he hello
1 Nicolas shark
2 welcome yes
If someone get the following error message如果有人收到以下错误消息
multiple repeat at position 2
在位置 2 多次重复
try this df.replace(dictionary, regex=False, inplace=True)
试试这个
df.replace(dictionary, regex=False, inplace=True)
instead of df.replace(dictionary, regex=True, inplace=True)
而不是
df.replace(dictionary, regex=True, inplace=True)
replace |
替换
|
with ,
in my_specific_column
dataframe column与
,
在my_specific_column
dataframe 列
df.my_specific_column = df.my_specific_column.str.replace('|', ',')
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.