[英]Applying calculations to filtered values in Pandas DataFrame
我是熊貓的新手。
考慮一下我的 DataFrame:
df
Search Impressions Clicks Transactions ContainsBest ContainsFree Country
Best phone 10 5 1 True False UK
Best free phone 15 4 2 True True UK
free phone 20 3 4 False True UK
good phone 13 1 5 False False US
just a free phone 12 3 4 False True US
我有列ContainsBest
和ContainsFree
。 我想求和所有Impressions
、 Clicks
和Transactions
,其中ContainsBest
為True
,然后我想總結Impressions
、 Clicks
和Transactions
,其中ContainsFree
為 True,並對列Country
每個唯一值執行相同的操作。 所以新的 DataFrame 看起來像這樣:
輸出_df
Country Impressions Clicks Transactions
UK 45 12 7
ContainsBest 25 9 3
ContainsFree 35 7 6
US 25 4 9
ContainsBest 0 0 0
ContainsFree 12 3 4
為此,我會理解我需要使用以下內容:
uk_toal_impressions = df['Impressions'].sum().where(df['Country']=='UK')
uk_best_impressions = df['Impressions'].sum().where(df['Country']=='UK' & df['ContainsBest'])
uk_free_impressions = df['Impressions'].sum().where(df['Country']=='UK' & df['ContainsFree'])
然后我會對Clicks
和Transactions
應用相同的邏輯,並為Country
US
重做相同的代碼。
我想要實現的第二件事是添加列TopCategories
per Country
和Impressions
, Clicks
和Transactions
,以便我的final_output_df
看起來像這樣:
final_output_df
Country Impressions Clicks Transactions TopCategoriesForImpressions TopCategoriesForClicks TopCategoriesForTransactions
UK 45 12 7 ContainsFree ContainsBest ContainsFree
ContainsBest 25 9 3 ContainsBest ContainsFree ContainsBest
ContainsFree 35 7 6
US 25 4 9 ContainsFree ContainsFree ContainsFree
ContainsBest 0 0 0
ContainsFree 12 3 4
TopCategoriesForxx
邏輯列是Country
列下的ContainsBest
和ContainsFree
行的簡單排序。 因此, UK
國家/地區的TopCategoriesForImpressions
是
UK
國家/地區的TopCategoriesForClicks
是:
我知道我需要使用這樣的東西:
TopCategoriesForImpressions = output_df['Impressions'].sort_values(by='Impressions', ascending=False).where(output_df['Country']=='UK')
我只是覺得很難把所有東西都放在我最后的final_output_df
。 另外,我假設我不需要創建output_df
,只是想添加它以便更好地理解我實現final_output_df
步驟。
所以我的問題是:
ContainsBest
和ContainsFree
TopCategoriesForImpressions
列Containsxxx
,有沒有辦法在不為 70 個國家和 20 個Containsxxx
列添加條件的情況下實現這一目標?非常感謝您的建議。
解決方案的第一部分應該是:
#removed unnecessary column Search and added ContainAll column filled Trues
df1 = df.drop('Search', 1).assign(ContainAll = True)
#columns for tests
cols1 = ['Impressions','Clicks','Transactions']
cols2 = ['ContainsBest','ContainsFree','ContainAll']
print (df1[cols2].dtypes)
ContainsBest bool
ContainsFree bool
ContainAll bool
dtype: object
print (df1[cols1].dtypes)
Impressions int64
Clicks int64
Transactions int64
dtype: object
print (df1.melt(['Country'] + cols1, var_name='Type', value_name='mask'))
Country Impressions Clicks Transactions Type mask
0 UK 10 5 1 ContainsBest True
1 UK 15 4 2 ContainsBest True
2 UK 20 3 4 ContainsBest False
3 US 13 1 5 ContainsBest False
4 US 12 3 4 ContainsBest False
5 UK 10 5 1 ContainsFree False
6 UK 15 4 2 ContainsFree True
7 UK 20 3 4 ContainsFree True
8 US 13 1 5 ContainsFree False
9 US 12 3 4 ContainsFree True
10 UK 10 5 1 ContainAll True
11 UK 15 4 2 ContainAll True
12 UK 20 3 4 ContainAll True
13 US 13 1 5 ContainAll True
14 US 12 3 4 ContainAll True
print (df1.melt(['Country'] + cols1, var_name='Type', value_name='mask').query('mask'))
Country Impressions Clicks Transactions Type mask
0 UK 10 5 1 ContainsBest True
1 UK 15 4 2 ContainsBest True
6 UK 15 4 2 ContainsFree True
7 UK 20 3 4 ContainsFree True
9 US 12 3 4 ContainsFree True
10 UK 10 5 1 ContainAll True
11 UK 15 4 2 ContainAll True
12 UK 20 3 4 ContainAll True
13 US 13 1 5 ContainAll True
14 US 12 3 4 ContainAll True
#all possible combinations of Country and boolean columns
mux = pd.MultiIndex.from_product([df['Country'].unique(), cols2],
names=['Country','Type'])
#reshape by melt for all boolean column to one mask column
#filter Trues by loc and aggregate sum
#add 0 rows by reindex
df1 = (df1.melt(['Country'] + cols1, var_name='Type', value_name='mask')
.query('mask')
.drop('mask', axis=1)
.groupby(['Country','Type'])
.sum()
.reindex(mux, fill_value=0)
.reset_index())
print (df1)
Country Type Impressions Clicks Transactions
0 UK ContainsBest 25 9 3
1 UK ContainsFree 35 7 6
2 UK ContainAll 45 12 7
3 US ContainsBest 0 0 0
4 US ContainsFree 12 3 4
5 US ContainAll 25 4 9
其次是可能的過濾器行,用於使用numpy.argsort
以每組降序檢查排序:
def f(x):
i = x.index.to_numpy()
a = i[(-x.to_numpy()).argsort(axis=0)]
return pd.DataFrame(a, columns=x.columns)
df2 = (df1[df1['Type'].isin(['ContainsBest','ContainsFree']) &
~df1[cols1].eq(0).all(1)]
.set_index('Type')
.groupby('Country')[cols1]
.apply(f)
.add_prefix('TopCategoriesFor')
.rename_axis(['Country','Type'])
.rename({0:'ContainsBest', 1:'ContainsFree'})
)
print (df2)
TopCategoriesForImpressions TopCategoriesForClicks \
Country Type
UK ContainsBest ContainsFree ContainsBest
ContainsFree ContainsBest ContainsFree
US ContainsBest ContainsFree ContainsFree
TopCategoriesForTransactions
Country Type
UK ContainsBest ContainsFree
ContainsFree ContainsBest
US ContainsBest ContainsFree
df3 = df1.join(df2, on=['Country','Type'])
print (df3)
Country Type Impressions Clicks Transactions \
0 UK ContainsBest 25 9 3
1 UK ContainsFree 35 7 6
2 UK ContainAll 45 12 7
3 US ContainsBest 0 0 0
4 US ContainsFree 12 3 4
5 US ContainAll 25 4 9
TopCategoriesForImpressions TopCategoriesForClicks \
0 ContainsFree ContainsBest
1 ContainsBest ContainsFree
2 NaN NaN
3 ContainsFree ContainsFree
4 NaN NaN
5 NaN NaN
TopCategoriesForTransactions
0 ContainsFree
1 ContainsBest
2 NaN
3 ContainsFree
4 NaN
5 NaN
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.