简体   繁体   English

在 pandas 中选择前 % 的行

[英]Selecting top % of rows in pandas

I have a sample dataframe as below (actual dataset is roughly 300k entries long):我有一个示例 dataframe 如下(实际数据集大约有 300k 个条目):


        user_id   revenue  
 ----- --------- --------- 
    0       234       100  
    1      2873       200  
    2       827       489  
    3        12       237  
    4      8942     28934  
  ...       ...       ...  
   96       498    892384  
   97      2345        92  
   98       239      2803  
   99      4985     98332  
  100       947      4588  

which displays the revenue generated by users.显示用户产生的收入。 I would like to select the rows where the top 20% of the revenue is generated (hence giving the top 20% revenue generating users).我想 select 产生前 20% 收入的行(因此给出前 20% 的收入产生用户)。

The methods that come closest to mind for me is calculating the total number of users, working out 20% of this,sorting the dataframe with sort_values() and then using head() or nlargest() , but I'd like to know if there is a simpler and elegant way.对我来说最接近的方法是计算用户总数,计算出其中的 20%,用sort_values()对 dataframe 进行排序,然后使用head()nlargest() ,但我想知道是否有一种更简单优雅的方式。

Can anybody propose a way for this?有人可以为此提出一种方法吗? Thank you!谢谢!

Suppose You have dataframe df :假设您有 dataframe df

user_id revenue
234     21  
2873    20  
827     23  
12      23  
8942    28  
498     22  
2345    20  
239     24  
4985    21  
947     25

I've flatten revenue distribution to show the idea.我已经扁平化收入分配来展示这个想法。 Now calculating step by step:现在逐步计算:

df = pd.read_clipboard()
df = df.sort_values(by = 'revenue', ascending = False)
df['revenue_cum'] = df['revenue'].cumsum()
df['%revenue_cum'] = df['revenue_cum']/df['revenue'].sum()
df

result:结果:

   user_id  revenue  revenue_cum  %revenue_cum
4     8942       28           28      0.123348
9      947       25           53      0.233480
7      239       24           77      0.339207
2      827       23          100      0.440529
3       12       23          123      0.541850
5      498       22          145      0.638767
0      234       21          166      0.731278
8     4985       21          187      0.823789
1     2873       20          207      0.911894
6     2345       20          227      1.000000

Only 2 top users generate 23.3% of total revenue.只有 2 个顶级用户产生了总收入的 23.3%。

This seems to be the case for df.quantile , from pandas documentation if you are looking for the top 20% all you need to do is pass the correct quantile value you desire.这似乎是df.quantile的情况,来自 pandas 文档,如果您正在寻找前 20%,您需要做的就是传递您想要的正确分位数值。

A case example from your dataset:您的数据集中的一个案例示例:

import pandas as pd
import numpy as np
df = pd.DataFrame({'user_id':[234,2873,827,12,8942],
                           'revenue':[100,200,489,237,28934]})
df.quantile([0.8,1],interpolation='nearest')

This would print the top 2 rows in value:这将打印值的前 2 行:

     user_id  revenue
0.8     2873      489
1.0     8942    28934

I usually find useful to use sort_values to see the cumulative effect of every row and then keep rows up to some threshold:我通常发现使用sort_values来查看每一行的累积效果,然后将行保持在某个阈值以下很有用:

# Sort values from highest to lowest:
df = df.sort_values(by='revenue', ascending=False)

# Add a column with aggregated effect of the row:
df['cumulative_percentage'] = 100*df.revenue.cumsum()/df.revenue.sum()

# Define the threshold I need to analyze and keep those rows:
min_threshold = 30
top_percent = df.loc[df['cumulative_percentage'] <= min_threshold]

The original df will be nicely sorted with a clear indication of the top contributing rows and the created 'top_percent' df will contain the rows that need to be analyzed in particular.原始 df 将很好地排序,清楚地表明贡献最大的行,创建的“top_percent”df 将包含需要特别分析的行。

I am assuming you are looking for the cumulative top 20% revenue generating users.我假设您正在寻找累计前 20% 的创收用户。 Here is a function that will help you get the expected output and even more.这是一个 function,它将帮助您获得预期的 output 甚至更多。 Just specify your dataframe, column name of the revenue and the n_percent you are looking for:只需指定您的 dataframe、收入的列名和您要查找的 n_percent:

import pandas as pd

def n_percent_revenue_generating_users(df, col, n_percent):
    df.sort_values(by=[col], ascending=False, inplace=True)
    df[f'{col}_cs'] = df[col].cumsum()
    df[f'{col}_csp'] = 100*df[f'{col}_cs']/df[col].sum()
    df_ = df[df[f'{col}_csp'] > n_percent]
    index_nearest = (df_[f'{col}_csp']-n_percent).abs().idxmin()
    threshold_revenue = df_.loc[index_nearest, col]
    output = df[df[col] >= threshold_revenue].drop(columns=[f'{col}_cs', f'{col}_csp'])
    
    return output
    
n_percent_revenue_generating_users(df, 'revenue', 20) 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM