簡體   English   中英

替換熊貓數據框中的值

[英]Replace values in a pandas dataframe

我有一個基於事件生成的熊貓dataframe 每個事件都有一個唯一的 ID,它會在數據框中生成重復的行。

問題是這些重復行中的一些包含隨機值,但它們彼此不同。

我需要根據每個 event_id 最常見的值替換列( Name, Age Occupation)中的值。

工資列也有尾隨連字符需要刪除它

提前致謝

輸入數據



print(df)

ID  event_id   Month    Name    Age Occupation Salary  
1   1_a        Jan      andrew  23             13414.12
2   1_a        Feb              NaN teacher    13414.12
3   1_a        Mar       ___                   13414.12
4   1_a        Apr      andrew  23  teacher    13414.12
5   1_a        May      andrew  24  principle  25000
6   1_b        Jan      Ash     45  scientist  1975.42_
7   1_b        Feb      #$%6        scientist  1975.42
8   1_b        Mar      Ash     45  ^#3a2g4    1975.42
9   1_b        Apr      Ash     45  scientist  1975.42

期望的輸出:

print(df)

ID  event_id   Month    Name    Age Occupation Salary
1   1_a        Jan      andrew  24  principle  25000
2   1_a        Feb      andrew  24  principle  25000
3   1_a        Mar      andrew  24  principle  25000
4   1_a        Apr      andrew  24  principle  25000
5   1_a        May      andrew  24  principle  25000
6   1_b        Jan      Ash     45  scientist  1975.42
7   1_b        Feb      Ash     45  scientist  1975.42
8   1_b        Mar      Ash     45  scientist  1975.42
9   1_b        Apr      Ash     45  scientist  1975.42

首先,我必須創建 DataFrame,不幸的是,我無法從帶有空格的 raw_string 中拆分值,但是在您的數據框中,這應該不是問題。

好的,現在邏輯:

該代碼創建了一個包含事件唯一值的列表,然后我對每個事件的列進行迭代。 使用集合,我可以得到一個字典來計算過濾事件列中值的頻率,並且最頻繁地設置其他值。

僅當您的表中重復的垃圾多於良好的值時,這才行不通。 例如:如果您在按事件過濾的列中有 30 個垃圾值,但只有好的那個被重復了 2 次,那么好的那個將是替換值。

如果按事件過濾的列中有 30 個垃圾值,但好的值只出現一次,那么隨機垃圾將是您的替換值。

這是代碼:

import pandas as pd
import collections

data =   """ID  event_id   Month    Name    Age Occupation Salary  
            1   1_a        Jan      andrew  23     -       13414.12
            2   1_a        Feb        -     NA  teacher    13414.12
            3   1_a        Mar       ___     -     z       13414.12
            4   1_a        Apr      andrew  23  teacher    13414.12
            5   1_a        May      andrew  24  principle  25000
            6   1_b        Jan      Ash     45  scientist  1975.42_
            7   1_b        Feb      #$%6     -  scientist  1975.42
            8   1_b        Mar      Ash     45  ^#3a2g4    1975.42
            9   1_b        Apr      Ash     45  scientist  1975.42"""

data = data.split('\n')[1:]

for i in range(len(data)):
    data[i] = data[i].split()

df = pd.DataFrame(data, columns=['ID', 'event_id','Month', 'Name', 'Age', 'Occupation', 'Salary'])

print(df)
print('\n')
events = set([x for x in df['event_id']])
columns = ['Name', 'Age', 'Occupation', 'Salary']
for event in events:
    print(df.loc[df['event_id'] == event])
    for column in columns:
        counter = collections.Counter(df.loc[df['event_id'] == event][column])
        print(df.loc[df['event_id'] == event][column])
        print()
        new_value = max(counter, key=counter.get)
        for i in df.loc[df['event_id'] == event][column].index.tolist():
            df[column][i] = new_value

print(df)

輸出:

  ID event_id Month    Name Age Occupation    Salary
0  1      1_a   Jan  andrew  23    teacher  13414.12
1  2      1_a   Feb  andrew  23    teacher  13414.12
2  3      1_a   Mar  andrew  23    teacher  13414.12
3  4      1_a   Apr  andrew  23    teacher  13414.12
4  5      1_a   May  andrew  23    teacher  13414.12
5  6      1_b   Jan     Ash  45  scientist   1975.42
6  7      1_b   Feb     Ash  45  scientist   1975.42
7  8      1_b   Mar     Ash  45  scientist   1975.42
8  9      1_b   Apr     Ash  45  scientist   1975.42

Process finished with exit code 0

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM