[英]Pandas: add percentage column
有 pandas DataFrame 為:
print(df)
call_id calling_number call_status
1 123 BUSY
2 456 BUSY
3 789 BUSY
4 123 NO_ANSWERED
5 456 NO_ANSWERED
6 789 NO_ANSWERED
在這種情況下,具有不同 call_status 的記錄(比如“錯誤”或其他,我無法預測),值可能會出現在 dataframe 中。 我需要為這樣的值動態添加一個新列。 我已經應用了 pivot_table() function 並且得到了我想要的結果:
df1 = df.pivot_table(df,index='calling_number',columns='status_code', aggfunc = 'count').fillna(0).astype('int64')
calling_number ANSWERED BUSY NO_ANSWER
123 0 1 1
456 0 1 1
789 0 1 1
現在我需要再添加一列,該列將包含具有給定 call_number 的已應答呼叫的百分比,計算為 ANSWERED 與總數的比率。 源 dataframe 'df' 可能不包含 call_status = 'ANSWERED' 的條目,因此在這種情況下,百分比列自然應該為零值。
預期結果是:
calling_number ANSWERED BUSY NO_ANSWER ANS_PERC(%)
123 0 1 1 0
456 0 1 1 0
789 0 1 1 0
使用crosstab
:
df1 = pd.crosstab(df['calling_number'], df['status_code'])
或者,如果需要通過count
function 排除NaN
,請使用帶有添加參數pivot_table
fill_value=0
的 pivot_table :
df1 = df.pivot_table(df,
index='calling_number',
columns='status_code',
aggfunc = 'count',
fill_value=0)
然后對於比率除以每行的總和值:
df1 = df1.div(df1.sum(axis=1), axis=0)
print (df1)
ANSWERED BUSY NO_ANSWER
calling_number
123 0.333333 0.333333 0.333333
456 0.333333 0.333333 0.333333
789 0.333333 0.333333 0.333333
編輯:為了添加可能不存在的某些類別,請使用DataFrame.reindex
:
df1 = (pd.crosstab(df['calling_number'], df['call_status'])
.reindex(columns=['ANSWERED','BUSY','NO_ANSWERED'], fill_value=0))
df1['ANS_PERC(%)'] = df1['ANSWERED'].div(df1['ANSWERED'].sum()).fillna(0)
print (df1)
call_status ANSWERED BUSY NO_ANSWERED ANS_PERC(%)
calling_number
123 0 1 1 0.0
456 0 1 1 0.0
789 0 1 1 0.0
如果需要每行總數:
df1['ANS_PERC(%)'] = df1['ANSWERED'].div(df1.sum(axis=1))
print (df1)
call_status ANSWERED BUSY NO_ANSWERED ANS_PERC(%)
calling_number
123 0 1 1 0.0
456 0 1 1 0.0
789 0 1 1 0.0
編輯1:
將一些錯誤值替換為ERROR
的解決方案:
print (df)
call_id calling_number call_status
0 1 123 ttt
1 2 456 BUSY
2 3 789 BUSY
3 4 123 NO_ANSWERED
4 5 456 NO_ANSWERED
5 6 789 NO_ANSWERED
L = ['ANSWERED', 'BUSY', 'NO_ANSWERED']
df['call_status'] = df['call_status'].where(df['call_status'].isin(L), 'ERROR')
print (df)
0 1 123 ERROR
1 2 456 BUSY
2 3 789 BUSY
3 4 123 NO_ANSWERED
4 5 456 NO_ANSWERED
5 6 789 NO_ANSWERED
df1 = (pd.crosstab(df['calling_number'], df['call_status'])
.reindex(columns=L + ['ERROR'], fill_value=0))
df1['ANS_PERC(%)'] = df1['ANSWERED'].div(df1.sum(axis=1))
print (df1)
call_status ANSWERED BUSY NO_ANSWERED ERROR ANS_PERC(%)
calling_number
123 0 0 1 1 0.0
456 0 1 1 0 0.0
789 0 1 1 0 0.0
我喜歡 cross_tab 的想法,但我是列操作的粉絲,因此很容易參考:
# define a function to capture all the other call_statuses into one bucket
def tester(x):
if x not in ['ANSWERED', 'BUSY', 'NO_ANSWERED']:
return 'OTHER'
else:
return x
#capture the simplified status in a new column
df['refined_status'] = df['call_status'].apply(tester)
#Do the pivot (or cross tab) to capture the sums:
df1= df.pivot_table(values="call_id", index = 'calling_number', columns='refined_status', aggfunc='count')
#Apply a division to get the percentages:
df1["TOTAL"] = df1[['ANSWERED', 'BUSY', 'NO_ANSWERED', 'OTHER']].sum(axis=1)
df1["ANS_PERC"] = df1["ANSWERED"]/df1.TOTAL * 100
print(df1)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.