[英]How to replace '..' and '?.' with single periods and question marks in pandas? df['column'].str.replace not working
[英]df replace is not working with seperator in pandas column
我有一个df
technologies= {
'Courses':["Spark,ABCD","PySpark","Hadoop","Python","Pandas"],
'Fee' :[22000,25000,23000,24000,26000],
'Duration':['30days','50days','30days', None,np.nan],
'Discount':[1000,2300,1000,1200,2500]
}
df = pd.DataFrame(technologies)
print(df)
我试图用 dict 值替换列值
dict = {"Spark" : 'S', "PySpark" : 'P', "Hadoop": 'H', "Python" : 'P', "Pandas": 'P'}
df2=df.replace({"Courses": dict})
print(df2)
但是带有分隔符的行即使存在值也不会被替换将其作为输出
Courses Fee Duration Discount
0 Spark,ABCD 22000 30days 1000
1 P 25000 50days 2300
2 H 23000 30days 1000
3 P 24000 None 1200
4 P 26000 NaN 2500
但输出应该是
Courses Fee Duration Discount
0 S,ABCD 22000 30days 1000
1 P 25000 50days 2300
2 H 23000 30days 1000
3 P 24000 None 1200
4 P 26000 NaN 2500
可能值得了解 regex 参数的工作原理,以便您将来可以利用它。 尽管如此,还是可以在,
上拆分并分解,以便每行有一个单词。 然后,您可以替换原始索引并将其分组,然后重新连接到逗号分隔的字符串。
import pandas as pd
technologies= {
'Courses':["Spark,ABCD","PySpark","Hadoop","Python","Pandas"],
'Fee' :[22000,25000,23000,24000,26000],
'Duration':['30days','50days','30days', None,np.nan],
'Discount':[1000,2300,1000,1200,2500]
}
df = pd.DataFrame(technologies)
d = {"Spark" : 'S', "PySpark" : 'P', "Hadoop": 'H', "Python" : 'P', "Pandas": 'P'}
df.Courses = (df.Courses.str.split(',').explode().replace(d)
.groupby(level=0).agg(','.join))
输出
Courses Fee Duration Discount
0 S,ABCD 22000 30days 1000
1 P 25000 50days 2300
2 H 23000 30days 1000
3 P 24000 None 1200
4 P 26000 NaN 2500
方法一:确保所有复合词都在单词之前。 在字典中PySpark
在Spark
之前
d = {"PySpark" : 'P', "Spark" : 'S', "Hadoop": 'H', "Python" : 'P', "Pandas": 'P'}
df2 = df.replace({"Courses": d}, regex = True)
print(df2)
Courses Fee Duration Discount
0 S,ABCD 22000 30days 1000
1 P 25000 50days 2300
2 H 23000 30days 1000
3 P 24000 None 1200
4 P 26000 NaN 2500
方法2:将单词放在边界中:
new_dict = pd.DataFrame(d.items(), columns = ['keys', 'values'])
new_dict['keys'] = '\\b' + new_dict['keys'] + '\\b'
new_dict = new_dict.set_index('keys').to_dict()['values']
df3 = df.replace({"Courses": new_dict}, regex = True)
df3
Courses Fee Duration Discount
0 S,ABCD 22000 30days 1000
1 P 25000 50days 2300
2 H 23000 30days 1000
3 P 24000 None 1200
4 P 26000 NaN 2500
这是一种专注于您要更改的列( Courses
)的方法:
dct = {"Spark" : 'S', "PySpark" : 'P', "Hadoop": 'H', "Python" : 'P', "Pandas": 'P'}
df.Courses = df.Courses.transform(
lambda x: x.str.split(',')).transform(
lambda x: [dct[y] if y in dct else y for y in x]).str.join(',')
解释:
transform
将列中的每个 csv 字符串值替换为列表transform
,这次是使用字典dct
替换值列表中的每个项目Series.str.join
将每个值的列表转换回 csv 字符串。完整的测试代码:
import pandas as pd
import numpy as np
technologies= {
'Courses':["Spark,ABCD","PySpark","Hadoop","Python","Pandas"],
'Fee' :[22000,25000,23000,24000,26000],
'Duration':['30days','50days','30days', None,np.nan],
'Discount':[1000,2300,1000,1200,2500]
}
df = pd.DataFrame(technologies)
print(df)
dct = {"Spark" : 'S', "PySpark" : 'P', "Hadoop": 'H', "Python" : 'P', "Pandas": 'P'}
df.Courses = df.Courses.transform(
lambda x: x.str.split(',')).transform(
lambda x: [dct[y] if y in dct else y for y in x]).str.join(',')
print(df)
输入:
Courses Fee Duration Discount
0 Spark,ABCD 22000 30days 1000
1 PySpark 25000 50days 2300
2 Hadoop 23000 30days 1000
3 Python 24000 None 1200
4 Pandas 26000 NaN 2500
输出:
Courses Fee Duration Discount
0 S,ABCD 22000 30days 1000
1 P 25000 50days 2300
2 H 23000 30days 1000
3 P 24000 None 1200
4 P 26000 NaN 2500
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.