[英]Calculating Percent of Total using groupby
I am having trouble trying to find a simple way to get the market share of products out of the total market.我很难找到一种简单的方法来从整个市场中获得产品的市场份额。 As an example, my dataframe is like the below:
例如,我的数据框如下所示:
For example, I have a dataframe like this below.例如,我有一个如下所示的数据框。 Let's say product A, B and C belong to a market called 1, and D, E, F belong to markets 2, 3, 4 respectively.
假设产品 A、B 和 C 属于称为 1 的市场,而 D、E、F 分别属于市场 2、3、4。 What I want to find is for each unique quarter (ie 1/6/2020 is quarter 2 of 2020), what is the market share of A, B and C out of the total market.
我想要找到的是对于每个独特的季度(即 1/6/2020 是 2020 年的第 2 季度),A、B 和 C 在整个市场中的市场份额是多少。 For example, if we want the market share of A, B and C (USA market) out of quarter 2 of 2020, then we need to take 100+200+300 divide by 100+200+300+400+500+600, which gives 600/2100 = 28.57%
例如,如果我们想要 2020 年第 2 季度的 A、B 和 C(美国市场)的市场份额,那么我们需要将 100+200+300 除以 100+200+300+400+500+600,这给出了 600/2100 = 28.57%
I am not sure what is the right way to approach it, so far I have to turn the whole dataframe into a 2d list and try to use for loops.我不确定接近它的正确方法是什么,到目前为止,我必须将整个数据帧转换为二维列表并尝试使用 for 循环。 I hope there is a neater and cleaner way to solve this.
我希望有一种更简洁的方法来解决这个问题。
Product Date Value
0 A 1/6/2020 100
1 B 1/6/2020 200
2 C 1/6/2020 300
3 D 1/6/2020 400
4 E 1/6/2020 500
5 F 1/6/2020 600
6 A 1/9/2020 600
7 B 1/9/2020 500
8 C 1/9/2020 400
9 D 1/9/2020 300
10 E 1/9/2020 200
11 F 1/9/2020 100
You're on the right track to consider groupby
!您在考虑
groupby
的正确轨道上!
Your dataframe needs to have the dimensions you mentioned, though -- the market, and the quarter.不过,您的数据框需要具有您提到的维度——市场和季度。 In addition, you probably want your Date column to be a datetime64 .
此外,您可能希望您的 Date 列是datetime64 。
Here is a code block that constructs a dataframe similar to what you have currently:这是一个代码块,它构建一个类似于您当前拥有的数据帧:
import pandas as pd
df = pd.DataFrame()
df["Product"] = ["A", "B", "C", "D", "E", "F"] * 2
df["Date"] = ["1/6/2020"] * 6 + ["1/9/2020"] * 6
df["Date"] = df["Date"].astype("datetime64[ns]")
df["Value"] = [100, 200, 300, 400, 500, 600] * 2
You might want to add a "Market" column, perhaps by defining a mapping from a product to a market, and adding it to your dataframe.您可能想要添加一个“市场”列,也许通过定义从产品到市场的映射,并将其添加到您的数据框。 Similarly, you could compute the quarter for each entry (although in your example, you seem to be saying that you want to treat the date object as the quarter).
同样,您可以计算每个条目的季度(尽管在您的示例中,您似乎是说要将日期对象视为季度)。
products_to_markets = {
"A": "USA", "B": "USA", "C": "USA",
"D": "Canada", "E": "Canada", "F": "Canada"
}
df["Market"] = df["Product"].map(products_to_markets)
df["Quarter"] = df["Date"].dt.to_period("Q")
Now you can begin to perform some of the other calculations you're interested in. For instance, you can see the total value per market per quarter:现在您可以开始执行您感兴趣的其他一些计算。例如,您可以查看每个市场每个季度的总价值:
df.groupby(["Quarter", "Market"]).sum()
I think what you're looking for is something like this:我认为你正在寻找的是这样的:
value_per_quarter = df.groupby("Quarter").sum()
df.groupby(["Quarter", "Market"]).sum() / value_per_quarter
Which yields:其中产生:
Value
Quarter Market
2020Q1 Canada 0.714286
USA 0.285714
First you have to create a dataframe which maps your products to your markets.首先,您必须创建一个数据框,将您的产品映射到您的市场。
Then use pd.crosstab()
to get a nice pivot table with argument normalize=index
giving you the percentages per row.然后使用
pd.crosstab()
得到一个很好的数据透视表,参数normalize=index
给你每行的百分比。
import pandas as pd
from io import StringIO
text = """
Product Date Value
0 A 1/6/2020 100
1 B 1/6/2020 200
2 C 1/6/2020 300
3 D 1/6/2020 400
4 E 1/6/2020 500
5 F 1/6/2020 600
6 A 1/9/2020 600
7 B 1/9/2020 500
8 C 1/9/2020 400
9 D 1/9/2020 300
10 E 1/9/2020 200
11 F 1/9/2020 100
"""
# create sample dataframe
df = pd.read_csv(StringIO(text), header=0, sep='\s+')
# create translation of products to markets
market_df = pd.DataFrame([
['A', 1], ['B', 1], ['C', 1],
['D', 2], ['E', 3], ['F', 4]],
columns=['Product', 'Market'],
)
# merge to get products mapped to markets
merged_df = pd.merge(
df,
market_df,
how='left',
on='Product',
)
# crosstab calculates totals per market and date
# normalize='index' calculates percentages over rows
pd.crosstab(
merged_df['Date'],
merged_df['Market'],
merged_df['Value'],
aggfunc='sum',
normalize='index',
)
Resulting dataframe:结果数据框:
Market
Date 1 2 3 4
1/6/2020 0.285714 0.190476 0.238095 0.285714
1/9/2020 0.714286 0.142857 0.095238 0.047619
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.