简体   繁体   English

使用 groupby 计算占总数的百分比

[英]Calculating Percent of Total using groupby

I am having trouble trying to find a simple way to get the market share of products out of the total market.我很难找到一种简单的方法来从整个市场中获得产品的市场份额。 As an example, my dataframe is like the below:例如,我的数据框如下所示:

For example, I have a dataframe like this below.例如,我有一个如下所示的数据框。 Let's say product A, B and C belong to a market called 1, and D, E, F belong to markets 2, 3, 4 respectively.假设产品 A、B 和 C 属于称为 1 的市场,而 D、E、F 分别属于市场 2、3、4。 What I want to find is for each unique quarter (ie 1/6/2020 is quarter 2 of 2020), what is the market share of A, B and C out of the total market.我想要找到的是对于每个独特的季度(即 1/6/2020 是 2020 年的第 2 季度),A、B 和 C 在整个市场中的市场份额是多少。 For example, if we want the market share of A, B and C (USA market) out of quarter 2 of 2020, then we need to take 100+200+300 divide by 100+200+300+400+500+600, which gives 600/2100 = 28.57%例如,如果我们想要 2020 年第 2 季度的 A、B 和 C(美国市场)的市场份额,那么我们需要将 100+200+300 除以 100+200+300+400+500+600,这给出了 600/2100 = 28.57%

I am not sure what is the right way to approach it, so far I have to turn the whole dataframe into a 2d list and try to use for loops.我不确定接近它的正确方法是什么,到目前为止,我必须将整个数据帧转换为二维列表并尝试使用 for 循环。 I hope there is a neater and cleaner way to solve this.我希望有一种更简洁的方法来解决这个问题。

  Product   Date       Value   
0   A        1/6/2020   100
1   B        1/6/2020   200
2   C        1/6/2020   300
3   D        1/6/2020   400
4   E        1/6/2020   500
5   F        1/6/2020   600
6   A        1/9/2020   600
7   B        1/9/2020   500
8   C        1/9/2020   400
9   D        1/9/2020   300
10  E        1/9/2020   200
11  F        1/9/2020   100

You're on the right track to consider groupby !您在考虑groupby的正确轨道上!

Your dataframe needs to have the dimensions you mentioned, though -- the market, and the quarter.不过,您的数据框需要具有您提到的维度——市场和季度。 In addition, you probably want your Date column to be a datetime64 .此外,您可能希望您的 Date 列是datetime64

Here is a code block that constructs a dataframe similar to what you have currently:这是一个代码块,它构建一个类似于您当前拥有的数据帧:

import pandas as pd

df = pd.DataFrame()
df["Product"] = ["A", "B", "C", "D", "E", "F"] * 2
df["Date"] = ["1/6/2020"] * 6 + ["1/9/2020"] * 6
df["Date"] = df["Date"].astype("datetime64[ns]")
df["Value"] = [100, 200, 300, 400, 500, 600] * 2

You might want to add a "Market" column, perhaps by defining a mapping from a product to a market, and adding it to your dataframe.您可能想要添加一个“市场”列,也许通过定义从产品到市场的映射,并将其添加到您的数据框。 Similarly, you could compute the quarter for each entry (although in your example, you seem to be saying that you want to treat the date object as the quarter).同样,您可以计算每个条目的季度(尽管在您的示例中,您似乎是说要将日期对象视为季度)。

products_to_markets = {
    "A": "USA", "B": "USA", "C": "USA",
    "D": "Canada", "E": "Canada", "F": "Canada"
}
df["Market"] = df["Product"].map(products_to_markets)
df["Quarter"] = df["Date"].dt.to_period("Q")

Now you can begin to perform some of the other calculations you're interested in. For instance, you can see the total value per market per quarter:现在您可以开始执行您感兴趣的其他一些计算。例如,您可以查看每个市场每个季度的总价值:

df.groupby(["Quarter", "Market"]).sum()

I think what you're looking for is something like this:我认为你正在寻找的是这样的:

value_per_quarter = df.groupby("Quarter").sum()
df.groupby(["Quarter", "Market"]).sum() / value_per_quarter

Which yields:其中产生:

                   Value
Quarter Market
2020Q1  Canada  0.714286
        USA     0.285714

First you have to create a dataframe which maps your products to your markets.首先,您必须创建一个数据框,将您的产品映射到您的市场。

Then use pd.crosstab() to get a nice pivot table with argument normalize=index giving you the percentages per row.然后使用pd.crosstab()得到一个很好的数据透视表,参数normalize=index给你每行的百分比。

import pandas as pd
from io import StringIO

text = """
  Product   Date       Value   
0   A        1/6/2020   100
1   B        1/6/2020   200
2   C        1/6/2020   300
3   D        1/6/2020   400
4   E        1/6/2020   500
5   F        1/6/2020   600
6   A        1/9/2020   600
7   B        1/9/2020   500
8   C        1/9/2020   400
9   D        1/9/2020   300
10  E        1/9/2020   200
11  F        1/9/2020   100
"""

# create sample dataframe
df = pd.read_csv(StringIO(text), header=0, sep='\s+')

# create translation of products to markets
market_df = pd.DataFrame([
    ['A', 1], ['B', 1], ['C', 1], 
    ['D', 2], ['E', 3], ['F', 4]], 
    columns=['Product', 'Market'],
)

# merge to get products mapped to markets
merged_df = pd.merge(
    df, 
    market_df, 
    how='left', 
    on='Product',
)

# crosstab calculates totals per market and date
# normalize='index' calculates percentages over rows
pd.crosstab(
    merged_df['Date'],
    merged_df['Market'], 
    merged_df['Value'], 
    aggfunc='sum', 
    normalize='index',
)

Resulting dataframe:结果数据框:

            Market  
Date        1           2           3           4           
1/6/2020    0.285714    0.190476    0.238095    0.285714
1/9/2020    0.714286    0.142857    0.095238    0.047619

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM