[英]Replace values in a pandas dataframe
我有一个基于事件生成的熊猫dataframe
。 每个事件都有一个唯一的 ID,它会在数据框中生成重复的行。
问题是这些重复行中的一些包含随机值,但它们彼此不同。
我需要根据每个 event_id 最常见的值替换列( Name, Age Occupation)
中的值。
工资列也有尾随连字符需要删除它
提前致谢
输入数据
print(df)
ID event_id Month Name Age Occupation Salary
1 1_a Jan andrew 23 13414.12
2 1_a Feb NaN teacher 13414.12
3 1_a Mar ___ 13414.12
4 1_a Apr andrew 23 teacher 13414.12
5 1_a May andrew 24 principle 25000
6 1_b Jan Ash 45 scientist 1975.42_
7 1_b Feb #$%6 scientist 1975.42
8 1_b Mar Ash 45 ^#3a2g4 1975.42
9 1_b Apr Ash 45 scientist 1975.42
期望的输出:
print(df)
ID event_id Month Name Age Occupation Salary
1 1_a Jan andrew 24 principle 25000
2 1_a Feb andrew 24 principle 25000
3 1_a Mar andrew 24 principle 25000
4 1_a Apr andrew 24 principle 25000
5 1_a May andrew 24 principle 25000
6 1_b Jan Ash 45 scientist 1975.42
7 1_b Feb Ash 45 scientist 1975.42
8 1_b Mar Ash 45 scientist 1975.42
9 1_b Apr Ash 45 scientist 1975.42
首先,我必须创建 DataFrame,不幸的是,我无法从带有空格的 raw_string 中拆分值,但是在您的数据框中,这应该不是问题。
好的,现在逻辑:
该代码创建了一个包含事件唯一值的列表,然后我对每个事件的列进行迭代。 使用集合,我可以得到一个字典来计算过滤事件列中值的频率,并且最频繁地设置其他值。
仅当您的表中重复的垃圾多于良好的值时,这才行不通。 例如:如果您在按事件过滤的列中有 30 个垃圾值,但只有好的那个被重复了 2 次,那么好的那个将是替换值。
如果按事件过滤的列中有 30 个垃圾值,但好的值只出现一次,那么随机垃圾将是您的替换值。
这是代码:
import pandas as pd
import collections
data = """ID event_id Month Name Age Occupation Salary
1 1_a Jan andrew 23 - 13414.12
2 1_a Feb - NA teacher 13414.12
3 1_a Mar ___ - z 13414.12
4 1_a Apr andrew 23 teacher 13414.12
5 1_a May andrew 24 principle 25000
6 1_b Jan Ash 45 scientist 1975.42_
7 1_b Feb #$%6 - scientist 1975.42
8 1_b Mar Ash 45 ^#3a2g4 1975.42
9 1_b Apr Ash 45 scientist 1975.42"""
data = data.split('\n')[1:]
for i in range(len(data)):
data[i] = data[i].split()
df = pd.DataFrame(data, columns=['ID', 'event_id','Month', 'Name', 'Age', 'Occupation', 'Salary'])
print(df)
print('\n')
events = set([x for x in df['event_id']])
columns = ['Name', 'Age', 'Occupation', 'Salary']
for event in events:
print(df.loc[df['event_id'] == event])
for column in columns:
counter = collections.Counter(df.loc[df['event_id'] == event][column])
print(df.loc[df['event_id'] == event][column])
print()
new_value = max(counter, key=counter.get)
for i in df.loc[df['event_id'] == event][column].index.tolist():
df[column][i] = new_value
print(df)
输出:
ID event_id Month Name Age Occupation Salary
0 1 1_a Jan andrew 23 teacher 13414.12
1 2 1_a Feb andrew 23 teacher 13414.12
2 3 1_a Mar andrew 23 teacher 13414.12
3 4 1_a Apr andrew 23 teacher 13414.12
4 5 1_a May andrew 23 teacher 13414.12
5 6 1_b Jan Ash 45 scientist 1975.42
6 7 1_b Feb Ash 45 scientist 1975.42
7 8 1_b Mar Ash 45 scientist 1975.42
8 9 1_b Apr Ash 45 scientist 1975.42
Process finished with exit code 0
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.