简体   繁体   English

替换熊猫数据框中的值

[英]Replace values in a pandas dataframe

I have a pandas dataframe which is generated based on events.我有一个基于事件生成的熊猫dataframe each event has an unique ID and it generates repeated rows in the dataframe.每个事件都有一个唯一的 ID,它会在数据框中生成重复的行。

The problem is that some of these repeated rows contains random values whih they are different from each other.问题是这些重复行中的一些包含随机值,但它们彼此不同。

I need to replace values in the columns ( Name, Age Occupation) based on the most frequent one per event_id.我需要根据每个 event_id 最常见的值替换列( Name, Age Occupation)中的值。

also the salary column has trailing hyphen needed to remove that as well工资列也有尾随连字符需要删除它

Thanks in advance提前致谢

input data输入数据



print(df)

ID  event_id   Month    Name    Age Occupation Salary  
1   1_a        Jan      andrew  23             13414.12
2   1_a        Feb              NaN teacher    13414.12
3   1_a        Mar       ___                   13414.12
4   1_a        Apr      andrew  23  teacher    13414.12
5   1_a        May      andrew  24  principle  25000
6   1_b        Jan      Ash     45  scientist  1975.42_
7   1_b        Feb      #$%6        scientist  1975.42
8   1_b        Mar      Ash     45  ^#3a2g4    1975.42
9   1_b        Apr      Ash     45  scientist  1975.42

Desired output :期望的输出:

print(df)

ID  event_id   Month    Name    Age Occupation Salary
1   1_a        Jan      andrew  24  principle  25000
2   1_a        Feb      andrew  24  principle  25000
3   1_a        Mar      andrew  24  principle  25000
4   1_a        Apr      andrew  24  principle  25000
5   1_a        May      andrew  24  principle  25000
6   1_b        Jan      Ash     45  scientist  1975.42
7   1_b        Feb      Ash     45  scientist  1975.42
8   1_b        Mar      Ash     45  scientist  1975.42
9   1_b        Apr      Ash     45  scientist  1975.42

First I had to create the DataFrame, unfortunately, I couldn't split the values from a raw_string with blank spaces, but in your dataframe, that shouldn't be a problem.首先,我必须创建 DataFrame,不幸的是,我无法从带有空格的 raw_string 中拆分值,但是在您的数据框中,这应该不是问题。

Ok, now the logic:好的,现在逻辑:

The code creates a list with unique values of the events, then I iterate for the columns of each event.该代码创建了一个包含事件唯一值的列表,然后我对每个事件的列进行迭代。 With collections I can get a dictionary to count the frequency of the values in the filtered event column, and with the most frequent I set up the others.使用集合,我可以得到一个字典来计算过滤事件列中值的频率,并且最频繁地设置其他值。

That only won't work if your table has more repeated junk than good values.仅当您的表中重复的垃圾多于良好的值时,这才行不通。 For example: If you have 30 junk values in a column filtered by event, but only the good one is repeated 2x, then the good one will be the replaced value.例如:如果您在按事件过滤的列中有 30 个垃圾值,但只有好的那个被重复了 2 次,那么好的那个将是替换值。

If you have 30 junk values in a column filtered by event, but the good one appears only one time, then a random junk will be your replaced value.如果按事件过滤的列中有 30 个垃圾值,但好的值只出现一次,那么随机垃圾将是您的替换值。

Here is the code:这是代码:

import pandas as pd
import collections

data =   """ID  event_id   Month    Name    Age Occupation Salary  
            1   1_a        Jan      andrew  23     -       13414.12
            2   1_a        Feb        -     NA  teacher    13414.12
            3   1_a        Mar       ___     -     z       13414.12
            4   1_a        Apr      andrew  23  teacher    13414.12
            5   1_a        May      andrew  24  principle  25000
            6   1_b        Jan      Ash     45  scientist  1975.42_
            7   1_b        Feb      #$%6     -  scientist  1975.42
            8   1_b        Mar      Ash     45  ^#3a2g4    1975.42
            9   1_b        Apr      Ash     45  scientist  1975.42"""

data = data.split('\n')[1:]

for i in range(len(data)):
    data[i] = data[i].split()

df = pd.DataFrame(data, columns=['ID', 'event_id','Month', 'Name', 'Age', 'Occupation', 'Salary'])

print(df)
print('\n')
events = set([x for x in df['event_id']])
columns = ['Name', 'Age', 'Occupation', 'Salary']
for event in events:
    print(df.loc[df['event_id'] == event])
    for column in columns:
        counter = collections.Counter(df.loc[df['event_id'] == event][column])
        print(df.loc[df['event_id'] == event][column])
        print()
        new_value = max(counter, key=counter.get)
        for i in df.loc[df['event_id'] == event][column].index.tolist():
            df[column][i] = new_value

print(df)

Output:输出:

  ID event_id Month    Name Age Occupation    Salary
0  1      1_a   Jan  andrew  23    teacher  13414.12
1  2      1_a   Feb  andrew  23    teacher  13414.12
2  3      1_a   Mar  andrew  23    teacher  13414.12
3  4      1_a   Apr  andrew  23    teacher  13414.12
4  5      1_a   May  andrew  23    teacher  13414.12
5  6      1_b   Jan     Ash  45  scientist   1975.42
6  7      1_b   Feb     Ash  45  scientist   1975.42
7  8      1_b   Mar     Ash  45  scientist   1975.42
8  9      1_b   Apr     Ash  45  scientist   1975.42

Process finished with exit code 0

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM