[英]df replace is not working with seperator in pandas column
I have a df我有一个df
technologies= {
'Courses':["Spark,ABCD","PySpark","Hadoop","Python","Pandas"],
'Fee' :[22000,25000,23000,24000,26000],
'Duration':['30days','50days','30days', None,np.nan],
'Discount':[1000,2300,1000,1200,2500]
}
df = pd.DataFrame(technologies)
print(df)
Im trying to replace column values with dict values我试图用 dict 值替换列值
dict = {"Spark" : 'S', "PySpark" : 'P', "Hadoop": 'H', "Python" : 'P', "Pandas": 'P'}
df2=df.replace({"Courses": dict})
print(df2)
but the rows with seperator , is not getting replaced even though there is values present Getting this as output但是带有分隔符的行即使存在值也不会被替换将其作为输出
Courses Fee Duration Discount
0 Spark,ABCD 22000 30days 1000
1 P 25000 50days 2300
2 H 23000 30days 1000
3 P 24000 None 1200
4 P 26000 NaN 2500
but the output should be但输出应该是
Courses Fee Duration Discount
0 S,ABCD 22000 30days 1000
1 P 25000 50days 2300
2 H 23000 30days 1000
3 P 24000 None 1200
4 P 26000 NaN 2500
It's probably worth learning about how the regex parameter works so that you can leverage it in the future.可能值得了解 regex 参数的工作原理,以便您将来可以利用它。 None the less it is possible to split on the
,
and explode so that you have one word per row.尽管如此,还是可以在
,
上拆分并分解,以便每行有一个单词。 Then you can replace and groupby the original index and join back to a comma separated string.然后,您可以替换原始索引并将其分组,然后重新连接到逗号分隔的字符串。
import pandas as pd
technologies= {
'Courses':["Spark,ABCD","PySpark","Hadoop","Python","Pandas"],
'Fee' :[22000,25000,23000,24000,26000],
'Duration':['30days','50days','30days', None,np.nan],
'Discount':[1000,2300,1000,1200,2500]
}
df = pd.DataFrame(technologies)
d = {"Spark" : 'S', "PySpark" : 'P', "Hadoop": 'H', "Python" : 'P', "Pandas": 'P'}
df.Courses = (df.Courses.str.split(',').explode().replace(d)
.groupby(level=0).agg(','.join))
Output输出
Courses Fee Duration Discount
0 S,ABCD 22000 30days 1000
1 P 25000 50days 2300
2 H 23000 30days 1000
3 P 24000 None 1200
4 P 26000 NaN 2500
Method 1: Ensure all the compound words are before the single words.方法一:确保所有复合词都在单词之前。 in the dictionary
PySpark
is before Spark
在字典中
PySpark
在Spark
之前
d = {"PySpark" : 'P', "Spark" : 'S', "Hadoop": 'H', "Python" : 'P', "Pandas": 'P'}
df2 = df.replace({"Courses": d}, regex = True)
print(df2)
Courses Fee Duration Discount
0 S,ABCD 22000 30days 1000
1 P 25000 50days 2300
2 H 23000 30days 1000
3 P 24000 None 1200
4 P 26000 NaN 2500
Method 2: Put the words in Boundary:方法2:将单词放在边界中:
new_dict = pd.DataFrame(d.items(), columns = ['keys', 'values'])
new_dict['keys'] = '\\b' + new_dict['keys'] + '\\b'
new_dict = new_dict.set_index('keys').to_dict()['values']
df3 = df.replace({"Courses": new_dict}, regex = True)
df3
Courses Fee Duration Discount
0 S,ABCD 22000 30days 1000
1 P 25000 50days 2300
2 H 23000 30days 1000
3 P 24000 None 1200
4 P 26000 NaN 2500
Here's a way to do it that focuses on the column you want to change ( Courses
):这是一种专注于您要更改的列(
Courses
)的方法:
dct = {"Spark" : 'S', "PySpark" : 'P', "Hadoop": 'H', "Python" : 'P', "Pandas": 'P'}
df.Courses = df.Courses.transform(
lambda x: x.str.split(',')).transform(
lambda x: [dct[y] if y in dct else y for y in x]).str.join(',')
Explanation:解释:
transform
to replace each csv string value in the column with a listtransform
将列中的每个 csv 字符串值替换为列表transform
again, this time to replace each item in a value's list using the dictionary dct
transform
,这次是使用字典dct
替换值列表中的每个项目Series.str.join
to convert each value's list back to a csv string.Series.str.join
将每个值的列表转换回 csv 字符串。 Full test code:完整的测试代码:
import pandas as pd
import numpy as np
technologies= {
'Courses':["Spark,ABCD","PySpark","Hadoop","Python","Pandas"],
'Fee' :[22000,25000,23000,24000,26000],
'Duration':['30days','50days','30days', None,np.nan],
'Discount':[1000,2300,1000,1200,2500]
}
df = pd.DataFrame(technologies)
print(df)
dct = {"Spark" : 'S', "PySpark" : 'P', "Hadoop": 'H', "Python" : 'P', "Pandas": 'P'}
df.Courses = df.Courses.transform(
lambda x: x.str.split(',')).transform(
lambda x: [dct[y] if y in dct else y for y in x]).str.join(',')
print(df)
Input:输入:
Courses Fee Duration Discount
0 Spark,ABCD 22000 30days 1000
1 PySpark 25000 50days 2300
2 Hadoop 23000 30days 1000
3 Python 24000 None 1200
4 Pandas 26000 NaN 2500
Output:输出:
Courses Fee Duration Discount
0 S,ABCD 22000 30days 1000
1 P 25000 50days 2300
2 H 23000 30days 1000
3 P 24000 None 1200
4 P 26000 NaN 2500
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.