繁体   English   中英

替换熊猫数据框中的值

[英]Replace values in a pandas dataframe

我有一个基于事件生成的熊猫dataframe 每个事件都有一个唯一的 ID,它会在数据框中生成重复的行。

问题是这些重复行中的一些包含随机值,但它们彼此不同。

我需要根据每个 event_id 最常见的值替换列( Name, Age Occupation)中的值。

工资列也有尾随连字符需要删除它

提前致谢

输入数据



print(df)

ID  event_id   Month    Name    Age Occupation Salary  
1   1_a        Jan      andrew  23             13414.12
2   1_a        Feb              NaN teacher    13414.12
3   1_a        Mar       ___                   13414.12
4   1_a        Apr      andrew  23  teacher    13414.12
5   1_a        May      andrew  24  principle  25000
6   1_b        Jan      Ash     45  scientist  1975.42_
7   1_b        Feb      #$%6        scientist  1975.42
8   1_b        Mar      Ash     45  ^#3a2g4    1975.42
9   1_b        Apr      Ash     45  scientist  1975.42

期望的输出:

print(df)

ID  event_id   Month    Name    Age Occupation Salary
1   1_a        Jan      andrew  24  principle  25000
2   1_a        Feb      andrew  24  principle  25000
3   1_a        Mar      andrew  24  principle  25000
4   1_a        Apr      andrew  24  principle  25000
5   1_a        May      andrew  24  principle  25000
6   1_b        Jan      Ash     45  scientist  1975.42
7   1_b        Feb      Ash     45  scientist  1975.42
8   1_b        Mar      Ash     45  scientist  1975.42
9   1_b        Apr      Ash     45  scientist  1975.42

首先,我必须创建 DataFrame,不幸的是,我无法从带有空格的 raw_string 中拆分值,但是在您的数据框中,这应该不是问题。

好的,现在逻辑:

该代码创建了一个包含事件唯一值的列表,然后我对每个事件的列进行迭代。 使用集合,我可以得到一个字典来计算过滤事件列中值的频率,并且最频繁地设置其他值。

仅当您的表中重复的垃圾多于良好的值时,这才行不通。 例如:如果您在按事件过滤的列中有 30 个垃圾值,但只有好的那个被重复了 2 次,那么好的那个将是替换值。

如果按事件过滤的列中有 30 个垃圾值,但好的值只出现一次,那么随机垃圾将是您的替换值。

这是代码:

import pandas as pd
import collections

data =   """ID  event_id   Month    Name    Age Occupation Salary  
            1   1_a        Jan      andrew  23     -       13414.12
            2   1_a        Feb        -     NA  teacher    13414.12
            3   1_a        Mar       ___     -     z       13414.12
            4   1_a        Apr      andrew  23  teacher    13414.12
            5   1_a        May      andrew  24  principle  25000
            6   1_b        Jan      Ash     45  scientist  1975.42_
            7   1_b        Feb      #$%6     -  scientist  1975.42
            8   1_b        Mar      Ash     45  ^#3a2g4    1975.42
            9   1_b        Apr      Ash     45  scientist  1975.42"""

data = data.split('\n')[1:]

for i in range(len(data)):
    data[i] = data[i].split()

df = pd.DataFrame(data, columns=['ID', 'event_id','Month', 'Name', 'Age', 'Occupation', 'Salary'])

print(df)
print('\n')
events = set([x for x in df['event_id']])
columns = ['Name', 'Age', 'Occupation', 'Salary']
for event in events:
    print(df.loc[df['event_id'] == event])
    for column in columns:
        counter = collections.Counter(df.loc[df['event_id'] == event][column])
        print(df.loc[df['event_id'] == event][column])
        print()
        new_value = max(counter, key=counter.get)
        for i in df.loc[df['event_id'] == event][column].index.tolist():
            df[column][i] = new_value

print(df)

输出:

  ID event_id Month    Name Age Occupation    Salary
0  1      1_a   Jan  andrew  23    teacher  13414.12
1  2      1_a   Feb  andrew  23    teacher  13414.12
2  3      1_a   Mar  andrew  23    teacher  13414.12
3  4      1_a   Apr  andrew  23    teacher  13414.12
4  5      1_a   May  andrew  23    teacher  13414.12
5  6      1_b   Jan     Ash  45  scientist   1975.42
6  7      1_b   Feb     Ash  45  scientist   1975.42
7  8      1_b   Mar     Ash  45  scientist   1975.42
8  9      1_b   Apr     Ash  45  scientist   1975.42

Process finished with exit code 0

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM