[英]Pandas pivot_table() aggfunc aggregation conditional on multiple columns?
我想用 Pandas 數據透視表聚合一列,但自定義聚合應該以數據幀中的不同列為條件。
請參見下面的示例:假設如果“Number_mentions”的值高於閾值,我想為“Newspaper”列中的每個值對“Number_mentions”列求和。 使用自定義 aggfunc 很容易做到這一點。 但是,此外,如果我只想對與“國家/地區”列中的值“RU”不在同一行的那些“Number_mentions”求和,該怎么辦? 似乎 aggfunc 只能將一列與其他列隔離開來,我不知道如何將整個數據幀放入 aggfunc 中以在 aggfunc 中進行條件子集化。
df = pd.DataFrame({"Number_mentions": [1,5,2,3,6,5],
"Newspaper": ["Newspaper1", "Newspaper1", "Newspaper2", "Newspaper3", "Newspaper4", "Newspaper5"],
"Country": ["US", "US", "CN", "CN", "RU", "RU"]})
def articles_above_thresh_with_condition(input_series, thresh=2):
series_bool = input_series > thresh
# ! add some if condition based on additional column in df:
# ! only aggregate those values where column "Country" is not "RU".
# ? code ?
n_articles_above_thresh = sum(series_bool)
return n_articles_above_thresh
df_piv = pd.pivot_table(df, values=["Number_mentions"],
index="Newspaper", columns=None, margins=False,
aggfunc=articles_above_thresh_with_condition)
您需要不同的方法,因為 pivot_table 不能處理 2 列。
因此,首先通過Series.where
將不匹配的值替換為缺失值,然后處理這個新列:
df["Number_mentions1"] = df["Number_mentions"].where(df["Country"].ne('RU'))
print (df)
Number_mentions Newspaper Country Number_mentions1
0 1 Newspaper1 US 1.0
1 5 Newspaper1 US 5.0
2 2 Newspaper2 CN 2.0
3 3 Newspaper3 CN 3.0
4 6 Newspaper4 RU NaN
5 5 Newspaper5 RU NaN
df_piv = pd.pivot_table(df, values=["Number_mentions1"],
index="Newspaper", columns=None, margins=False,
aggfunc=articles_above_thresh_with_condition)
print (df_piv)
Number_mentions1
Newspaper
Newspaper1 1.0
Newspaper2 0.0
Newspaper3 1.0
Newspaper4 0.0
Newspaper5 0.0
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.