[英]Replace values in a pandas dataframe
我有一個基於事件生成的熊貓dataframe
。 每個事件都有一個唯一的 ID,它會在數據框中生成重復的行。
問題是這些重復行中的一些包含隨機值,但它們彼此不同。
我需要根據每個 event_id 最常見的值替換列( Name, Age Occupation)
中的值。
工資列也有尾隨連字符需要刪除它
提前致謝
輸入數據
print(df)
ID event_id Month Name Age Occupation Salary
1 1_a Jan andrew 23 13414.12
2 1_a Feb NaN teacher 13414.12
3 1_a Mar ___ 13414.12
4 1_a Apr andrew 23 teacher 13414.12
5 1_a May andrew 24 principle 25000
6 1_b Jan Ash 45 scientist 1975.42_
7 1_b Feb #$%6 scientist 1975.42
8 1_b Mar Ash 45 ^#3a2g4 1975.42
9 1_b Apr Ash 45 scientist 1975.42
期望的輸出:
print(df)
ID event_id Month Name Age Occupation Salary
1 1_a Jan andrew 24 principle 25000
2 1_a Feb andrew 24 principle 25000
3 1_a Mar andrew 24 principle 25000
4 1_a Apr andrew 24 principle 25000
5 1_a May andrew 24 principle 25000
6 1_b Jan Ash 45 scientist 1975.42
7 1_b Feb Ash 45 scientist 1975.42
8 1_b Mar Ash 45 scientist 1975.42
9 1_b Apr Ash 45 scientist 1975.42
首先,我必須創建 DataFrame,不幸的是,我無法從帶有空格的 raw_string 中拆分值,但是在您的數據框中,這應該不是問題。
好的,現在邏輯:
該代碼創建了一個包含事件唯一值的列表,然后我對每個事件的列進行迭代。 使用集合,我可以得到一個字典來計算過濾事件列中值的頻率,並且最頻繁地設置其他值。
僅當您的表中重復的垃圾多於良好的值時,這才行不通。 例如:如果您在按事件過濾的列中有 30 個垃圾值,但只有好的那個被重復了 2 次,那么好的那個將是替換值。
如果按事件過濾的列中有 30 個垃圾值,但好的值只出現一次,那么隨機垃圾將是您的替換值。
這是代碼:
import pandas as pd
import collections
data = """ID event_id Month Name Age Occupation Salary
1 1_a Jan andrew 23 - 13414.12
2 1_a Feb - NA teacher 13414.12
3 1_a Mar ___ - z 13414.12
4 1_a Apr andrew 23 teacher 13414.12
5 1_a May andrew 24 principle 25000
6 1_b Jan Ash 45 scientist 1975.42_
7 1_b Feb #$%6 - scientist 1975.42
8 1_b Mar Ash 45 ^#3a2g4 1975.42
9 1_b Apr Ash 45 scientist 1975.42"""
data = data.split('\n')[1:]
for i in range(len(data)):
data[i] = data[i].split()
df = pd.DataFrame(data, columns=['ID', 'event_id','Month', 'Name', 'Age', 'Occupation', 'Salary'])
print(df)
print('\n')
events = set([x for x in df['event_id']])
columns = ['Name', 'Age', 'Occupation', 'Salary']
for event in events:
print(df.loc[df['event_id'] == event])
for column in columns:
counter = collections.Counter(df.loc[df['event_id'] == event][column])
print(df.loc[df['event_id'] == event][column])
print()
new_value = max(counter, key=counter.get)
for i in df.loc[df['event_id'] == event][column].index.tolist():
df[column][i] = new_value
print(df)
輸出:
ID event_id Month Name Age Occupation Salary
0 1 1_a Jan andrew 23 teacher 13414.12
1 2 1_a Feb andrew 23 teacher 13414.12
2 3 1_a Mar andrew 23 teacher 13414.12
3 4 1_a Apr andrew 23 teacher 13414.12
4 5 1_a May andrew 23 teacher 13414.12
5 6 1_b Jan Ash 45 scientist 1975.42
6 7 1_b Feb Ash 45 scientist 1975.42
7 8 1_b Mar Ash 45 scientist 1975.42
8 9 1_b Apr Ash 45 scientist 1975.42
Process finished with exit code 0
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.