[英]Replace values in a pandas dataframe
I have a pandas dataframe
which is generated based on events.我有一个基于事件生成的熊猫dataframe
。 each event has an unique ID and it generates repeated rows in the dataframe.每个事件都有一个唯一的 ID,它会在数据框中生成重复的行。
The problem is that some of these repeated rows contains random values whih they are different from each other.问题是这些重复行中的一些包含随机值,但它们彼此不同。
I need to replace values in the columns ( Name, Age Occupation)
based on the most frequent one per event_id.我需要根据每个 event_id 最常见的值替换列( Name, Age Occupation)
中的值。
also the salary column has trailing hyphen needed to remove that as well工资列也有尾随连字符需要删除它
Thanks in advance提前致谢
input data输入数据
print(df)
ID event_id Month Name Age Occupation Salary
1 1_a Jan andrew 23 13414.12
2 1_a Feb NaN teacher 13414.12
3 1_a Mar ___ 13414.12
4 1_a Apr andrew 23 teacher 13414.12
5 1_a May andrew 24 principle 25000
6 1_b Jan Ash 45 scientist 1975.42_
7 1_b Feb #$%6 scientist 1975.42
8 1_b Mar Ash 45 ^#3a2g4 1975.42
9 1_b Apr Ash 45 scientist 1975.42
Desired output :期望的输出:
print(df)
ID event_id Month Name Age Occupation Salary
1 1_a Jan andrew 24 principle 25000
2 1_a Feb andrew 24 principle 25000
3 1_a Mar andrew 24 principle 25000
4 1_a Apr andrew 24 principle 25000
5 1_a May andrew 24 principle 25000
6 1_b Jan Ash 45 scientist 1975.42
7 1_b Feb Ash 45 scientist 1975.42
8 1_b Mar Ash 45 scientist 1975.42
9 1_b Apr Ash 45 scientist 1975.42
First I had to create the DataFrame, unfortunately, I couldn't split the values from a raw_string with blank spaces, but in your dataframe, that shouldn't be a problem.首先,我必须创建 DataFrame,不幸的是,我无法从带有空格的 raw_string 中拆分值,但是在您的数据框中,这应该不是问题。
Ok, now the logic:好的,现在逻辑:
The code creates a list with unique values of the events, then I iterate for the columns of each event.该代码创建了一个包含事件唯一值的列表,然后我对每个事件的列进行迭代。 With collections I can get a dictionary to count the frequency of the values in the filtered event column, and with the most frequent I set up the others.使用集合,我可以得到一个字典来计算过滤事件列中值的频率,并且最频繁地设置其他值。
That only won't work if your table has more repeated junk than good values.仅当您的表中重复的垃圾多于良好的值时,这才行不通。 For example: If you have 30 junk values in a column filtered by event, but only the good one is repeated 2x, then the good one will be the replaced value.例如:如果您在按事件过滤的列中有 30 个垃圾值,但只有好的那个被重复了 2 次,那么好的那个将是替换值。
If you have 30 junk values in a column filtered by event, but the good one appears only one time, then a random junk will be your replaced value.如果按事件过滤的列中有 30 个垃圾值,但好的值只出现一次,那么随机垃圾将是您的替换值。
Here is the code:这是代码:
import pandas as pd
import collections
data = """ID event_id Month Name Age Occupation Salary
1 1_a Jan andrew 23 - 13414.12
2 1_a Feb - NA teacher 13414.12
3 1_a Mar ___ - z 13414.12
4 1_a Apr andrew 23 teacher 13414.12
5 1_a May andrew 24 principle 25000
6 1_b Jan Ash 45 scientist 1975.42_
7 1_b Feb #$%6 - scientist 1975.42
8 1_b Mar Ash 45 ^#3a2g4 1975.42
9 1_b Apr Ash 45 scientist 1975.42"""
data = data.split('\n')[1:]
for i in range(len(data)):
data[i] = data[i].split()
df = pd.DataFrame(data, columns=['ID', 'event_id','Month', 'Name', 'Age', 'Occupation', 'Salary'])
print(df)
print('\n')
events = set([x for x in df['event_id']])
columns = ['Name', 'Age', 'Occupation', 'Salary']
for event in events:
print(df.loc[df['event_id'] == event])
for column in columns:
counter = collections.Counter(df.loc[df['event_id'] == event][column])
print(df.loc[df['event_id'] == event][column])
print()
new_value = max(counter, key=counter.get)
for i in df.loc[df['event_id'] == event][column].index.tolist():
df[column][i] = new_value
print(df)
Output:输出:
ID event_id Month Name Age Occupation Salary
0 1 1_a Jan andrew 23 teacher 13414.12
1 2 1_a Feb andrew 23 teacher 13414.12
2 3 1_a Mar andrew 23 teacher 13414.12
3 4 1_a Apr andrew 23 teacher 13414.12
4 5 1_a May andrew 23 teacher 13414.12
5 6 1_b Jan Ash 45 scientist 1975.42
6 7 1_b Feb Ash 45 scientist 1975.42
7 8 1_b Mar Ash 45 scientist 1975.42
8 9 1_b Apr Ash 45 scientist 1975.42
Process finished with exit code 0
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.