简体   繁体   English

Pandas 与 groupby 的总百分比

[英]Pandas percentage of total with groupby

This is obviously simple, but as a numpy newbe I'm getting stuck.这显然很简单,但作为 numpy 新手,我遇到了困难。

I have a CSV file that contains 3 columns, the State, the Office ID, and the Sales for that office.我有一个 CSV 文件,其中包含 3 列、State、办公室 ID 和该办公室的销售。

I want to calculate the percentage of sales per office in a given state (total of all percentages in each state is 100%).我想计算给定 state 中每个办公室的销售额百分比(每个 state 中所有百分比的总和为 100%)。

df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
                   'office_id': range(1, 7) * 2,
                   'sales': [np.random.randint(100000, 999999)
                             for _ in range(12)]})

df.groupby(['state', 'office_id']).agg({'sales': 'sum'})

This returns:这将返回:

                  sales
state office_id        
AZ    2          839507
      4          373917
      6          347225
CA    1          798585
      3          890850
      5          454423
CO    1          819975
      3          202969
      5          614011
WA    2          163942
      4          369858
      6          959285

I can't seem to figure out how to "reach up" to the state level of the groupby to total up the sales for the entire state to calculate the fraction.我似乎无法弄清楚如何“达到” groupbystate水平以合计整个statesales来计算分数。

Update 2022-03 2022 年 3 月更新

This answer by caner using transform looks much better than my original answer! caner使用transform这个答案看起来比我原来的答案好得多!

df['sales'] / df.groupby('state')['sales'].transform('sum')

Thanks to this comment by Paul Rougieux for surfacing it.感谢Paul Rougieux发表的评论

Original Answer (2014)原始答案(2014)

Paul H's answer is right that you will have to make a second groupby object, but you can calculate the percentage in a simpler way -- just groupby the state_office and divide the sales column by its sum. Paul H 的回答是正确的,您必须创建第二个groupby对象,但您可以以更简单的方式计算百分比 - 只需groupby state_office并将sales列除以其总和。 Copying the beginning of Paul H's answer:复制 Paul H 答案的开头:

# From Paul H
import numpy as np
import pandas as pd
np.random.seed(0)
df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
                   'office_id': list(range(1, 7)) * 2,
                   'sales': [np.random.randint(100000, 999999)
                             for _ in range(12)]})
state_office = df.groupby(['state', 'office_id']).agg({'sales': 'sum'})
# Change: groupby state_office and divide by sum
state_pcts = state_office.groupby(level=0).apply(lambda x:
                                                 100 * x / float(x.sum()))

Returns:回报:

                     sales
state office_id           
AZ    2          16.981365
      4          19.250033
      6          63.768601
CA    1          19.331879
      3          33.858747
      5          46.809373
CO    1          36.851857
      3          19.874290
      5          43.273852
WA    2          34.707233
      4          35.511259
      6          29.781508

You need to make a second groupby object that groups by the states, and then use the div method:您需要创建第二个按状态分组的 groupby 对象,然后使用div方法:

import numpy as np
import pandas as pd
np.random.seed(0)
df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
               'office_id': list(range(1, 7)) * 2,
               'sales': [np.random.randint(100000, 999999) for _ in range(12)]})

state_office = df.groupby(['state', 'office_id']).agg({'sales': 'sum'})
state = df.groupby(['state']).agg({'sales': 'sum'})
state_office.div(state, level='state') * 100


                     sales
state office_id           
AZ    2          16.981365
      4          19.250033
      6          63.768601
CA    1          19.331879
      3          33.858747
      5          46.809373
CO    1          36.851857
      3          19.874290
      5          43.273852
WA    2          34.707233
      4          35.511259
      6          29.781508

the level='state' kwarg in div tells pandas to broadcast/join the dataframes base on the values in the state level of the index. div中的level='state' kwarg 告诉 pandas 根据索引state级别的值广播/加入数据帧。

(This solution is inspired from this article https://pbpython.com/pandas_transform.html ) (这个解决方案的灵感来自这篇文章https://pbpython.com/pandas_transform.html

I find the following solution to be the simplest(and probably the fastest) using transformation :我发现以下解决方案是使用transformation的最简单(可能也是最快)的解决方案:

Transformation: While aggregation must return a reduced version of the data, transformation can return some transformed version of the full data to recombine.转换:虽然聚合必须返回数据的简化版本,但转换可以返回完整数据的某些转换版本以进行重组。 For such a transformation, the output is the same shape as the input.对于这样的转换,输出与输入的形状相同。

So using transformation , the solution is 1-liner:所以使用transformation ,解决方案是 1-liner:

df['%'] = 100 * df['sales'] / df.groupby('state')['sales'].transform('sum')

And if you print:如果你打印:

print(df.sort_values(['state', 'office_id']).reset_index(drop=True))

   state  office_id   sales          %
0     AZ          2  195197   9.844309
1     AZ          4  877890  44.274352
2     AZ          6  909754  45.881339
3     CA          1  614752  50.415708
4     CA          3  395340  32.421767
5     CA          5  209274  17.162525
6     CO          1  549430  42.659629
7     CO          3  457514  35.522956
8     CO          5  280995  21.817415
9     WA          2  828238  35.696929
10    WA          4  719366  31.004563
11    WA          6  772590  33.298509

For conciseness I'd use the SeriesGroupBy:为简洁起见,我将使用 SeriesGroupBy:

In [11]: c = df.groupby(['state', 'office_id'])['sales'].sum().rename("count")

In [12]: c
Out[12]:
state  office_id
AZ     2            925105
       4            592852
       6            362198
CA     1            819164
       3            743055
       5            292885
CO     1            525994
       3            338378
       5            490335
WA     2            623380
       4            441560
       6            451428
Name: count, dtype: int64

In [13]: c / c.groupby(level=0).sum()
Out[13]:
state  office_id
AZ     2            0.492037
       4            0.315321
       6            0.192643
CA     1            0.441573
       3            0.400546
       5            0.157881
CO     1            0.388271
       3            0.249779
       5            0.361949
WA     2            0.411101
       4            0.291196
       6            0.297703
Name: count, dtype: float64

For multiple groups you have to use transform (using Radical's df ):对于多个组,您必须使用变换(使用Radical 的 df ):

In [21]: c =  df.groupby(["Group 1","Group 2","Final Group"])["Numbers I want as percents"].sum().rename("count")

In [22]: c / c.groupby(level=[0, 1]).transform("sum")
Out[22]:
Group 1  Group 2  Final Group
AAHQ     BOSC     OWON           0.331006
                  TLAM           0.668994
         MQVF     BWSI           0.288961
                  FXZM           0.711039
         ODWV     NFCH           0.262395
...
Name: count, dtype: float64

This seems to be slightly more performant than the other answers (just less than twice the speed of Radical's answer, for me ~0.08s).这似乎比其他答案的性能略高(对我来说,只是不到 Radical 答案速度的两倍,大约 0.08 秒)。

I think this needs benchmarking.我认为这需要进行基准测试。 Using OP's original DataFrame,使用 OP 的原始 DataFrame,

df = pd.DataFrame({
    'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
    'office_id': range(1, 7) * 2,
    'sales': [np.random.randint(100000, 999999) for _ in range(12)]
})

1st Andy Hayden第一安迪·海登

As commented on his answer, Andy takes full advantage of vectorisation and pandas indexing.正如对他的回答所评论的那样,Andy 充分利用了矢量化和 pandas 索引。

c = df.groupby(['state', 'office_id'])['sales'].sum().rename("count")
c / c.groupby(level=0).sum()

3.42 ms ± 16.7 µs per loop每个循环3.42 毫秒± 16.7 微秒
(mean ± std. dev. of 7 runs, 100 loops each) (平均值±标准偏差。7 次运行,每次 100 次循环)


2nd Paul H第二个保罗 H

state_office = df.groupby(['state', 'office_id']).agg({'sales': 'sum'})
state = df.groupby(['state']).agg({'sales': 'sum'})
state_office.div(state, level='state') * 100

4.66 ms ± 24.4 µs per loop每个循环4.66 毫秒± 24.4 微秒
(mean ± std. dev. of 7 runs, 100 loops each) (平均值±标准偏差。7 次运行,每次 100 次循环)


3rd exp1orer第三探险家

This is the slowest answer as it calculates x.sum() for each x in level 0.这是最慢的答案,因为它为级别 0 中的每个x计算x.sum()

For me, this is still a useful answer, though not in its current form.对我来说,这仍然是一个有用的答案,尽管不是目前的形式。 For quick EDA on smaller datasets, apply allows you use method chaining to write this in a single line.对于较小数据集的快速 EDA, apply允许您使用方法链接将其写在一行中。 We therefore remove the need decide on a variable's name, which is actually very computationally expensive for your most valuable resource (your brain!!).因此,我们不再需要决定变量的名称,这对于您最宝贵的资源(您的大脑!!)而言实际上在计算上是非常昂贵的。

Here is the modification,这里是修改,

(
    df.groupby(['state', 'office_id'])
    .agg({'sales': 'sum'})
    .groupby(level=0)
    .apply(lambda x: 100 * x / float(x.sum()))
)

10.6 ms ± 81.5 µs per loop每个循环10.6 毫秒± 81.5 微秒
(mean ± std. dev. of 7 runs, 100 loops each) (平均值±标准偏差。7 次运行,每次 100 次循环)


So no one is going care about 6ms on a small dataset.所以没有人会关心一个小数据集上的 6ms。 However, this is 3x speed up and, on a larger dataset with high cardinality groupbys this is going to make a massive difference.然而,这是 3 倍的速度提升,并且在具有高基数 groupbys 的更大数据集上,这将产生巨大的差异。

Adding to the above code, we make a DataFrame with shape (12,000,000, 3) with 14412 state categories and 600 office_ids,添加到上面的代码中,我们创建了一个形状为 (12,000,000, 3) 的 DataFrame,其中包含 14412 个状态类别和 600 个 office_ids,

import string

import numpy as np
import pandas as pd
np.random.seed(0)

groups = [
    ''.join(i) for i in zip(
    np.random.choice(np.array([i for i in string.ascii_lowercase]), 30000),
    np.random.choice(np.array([i for i in string.ascii_lowercase]), 30000),
    np.random.choice(np.array([i for i in string.ascii_lowercase]), 30000),
                       )
]

df = pd.DataFrame({'state': groups * 400,
               'office_id': list(range(1, 601)) * 20000,
               'sales': [np.random.randint(100000, 999999)
                         for _ in range(12)] * 1000000
})

Using Andy's,使用安迪的,

2 s ± 10.4 ms per loop每个循环2 秒± 10.4 毫秒
(mean ± std. dev. of 7 runs, 1 loop each) (平均值±标准差。7 次运行,每次 1 个循环)

and exp1orer和探险家

19 s ± 77.1 ms per loop每个循环19 秒± 77.1 毫秒
(mean ± std. dev. of 7 runs, 1 loop each) (平均值±标准差。7 次运行,每次 1 个循环)

So now we see x10 speed up on large, high cardinality datasets.所以现在我们看到 x10 在大型、高基数数据集上加速。


Be sure to UV these three answers if you UV this one!!如果您对这三个答案进行紫外线,请务必对这三个答案进行紫外线!!

I realize there are already good answers here.我意识到这里已经有了很好的答案。

I nevertheless would like to contribute my own, because I feel for an elementary, simple question like this, there should be a short solution that is understandable at a glance.尽管如此,我还是想贡献自己的力量,因为我觉得像这样一个基本的、简单的问题,应该有一个简短的解决方案,一目了然。

It should also work in a way that I can add the percentages as a new column, leaving the rest of the dataframe untouched.它还应该以一种我可以将百分比添加为新列的方式工作,而使数据框的其余部分保持不变。 Last but not least, it should generalize in an obvious way to the case in which there is more than one grouping level (eg, state and country instead of only state).最后但同样重要的是,它应该以一种明显的方式推广到存在多个分组级别的情况(例如,州和国家,而不是只有州)。

The following snippet fulfills these criteria:以下代码段满足这些条件:

df['sales_ratio'] = df.groupby(['state'])['sales'].transform(lambda x: x/x.sum())

Note that if you're still using Python 2, you'll have to replace the x in the denominator of the lambda term by float(x).请注意,如果您仍在使用 Python 2,则必须将 lambda 项的分母中的 x 替换为 float(x)。

I know that this is an old question, but exp1orer's answer is very slow for datasets with a large number unique groups (probably because of the lambda).我知道这是一个老问题,但是对于具有大量唯一组的数据集(可能是由于 lambda), exp1orer 的回答非常慢。 I built off of their answer to turn it into an array calculation so now it's super fast!我根据他们的答案将其转换为数组计算,所以现在它超级快! Below is the example code:下面是示例代码:

Create the test dataframe with 50,000 unique groups创建具有 50,000 个唯一组的测试数据框

import random
import string
import pandas as pd
import numpy as np
np.random.seed(0)

# This is the total number of groups to be created
NumberOfGroups = 50000

# Create a lot of groups (random strings of 4 letters)
Group1     = [''.join(random.choice(string.ascii_uppercase) for _ in range(4)) for x in range(NumberOfGroups/10)]*10
Group2     = [''.join(random.choice(string.ascii_uppercase) for _ in range(4)) for x in range(NumberOfGroups/2)]*2
FinalGroup = [''.join(random.choice(string.ascii_uppercase) for _ in range(4)) for x in range(NumberOfGroups)]

# Make the numbers
NumbersForPercents = [np.random.randint(100, 999) for _ in range(NumberOfGroups)]

# Make the dataframe
df = pd.DataFrame({'Group 1': Group1,
                   'Group 2': Group2,
                   'Final Group': FinalGroup,
                   'Numbers I want as percents': NumbersForPercents})

When grouped it looks like:分组时看起来像:

                             Numbers I want as percents
Group 1 Group 2 Final Group                            
AAAH    AQYR    RMCH                                847
                XDCL                                182
        DQGO    ALVF                                132
                AVPH                                894
        OVGH    NVOO                                650
                VKQP                                857
        VNLY    HYFW                                884
                MOYH                                469
        XOOC    GIDS                                168
                HTOY                                544
AACE    HNXU    RAXK                                243
                YZNK                                750
        NOYI    NYGC                                399
                ZYCI                                614
        QKGK    CRLF                                520
                UXNA                                970
        TXAR    MLNB                                356
                NMFJ                                904
        VQYG    NPON                                504
                QPKQ                                948
...
[50000 rows x 1 columns]

Array method of finding percentage:求百分比的数组方法:

# Initial grouping (basically a sorted version of df)
PreGroupby_df = df.groupby(["Group 1","Group 2","Final Group"]).agg({'Numbers I want as percents': 'sum'}).reset_index()
# Get the sum of values for the "final group", append "_Sum" to it's column name, and change it into a dataframe (.reset_index)
SumGroup_df = df.groupby(["Group 1","Group 2"]).agg({'Numbers I want as percents': 'sum'}).add_suffix('_Sum').reset_index()
# Merge the two dataframes
Percents_df = pd.merge(PreGroupby_df, SumGroup_df)
# Divide the two columns
Percents_df["Percent of Final Group"] = Percents_df["Numbers I want as percents"] / Percents_df["Numbers I want as percents_Sum"] * 100
# Drop the extra _Sum column
Percents_df.drop(["Numbers I want as percents_Sum"], inplace=True, axis=1)

This method takes about ~0.15 seconds此方法大约需要 ~0.15 秒

Top answer method (using lambda function):最佳答案方法(使用 lambda 函数):

state_office = df.groupby(['Group 1','Group 2','Final Group']).agg({'Numbers I want as percents': 'sum'})
state_pcts = state_office.groupby(level=['Group 1','Group 2']).apply(lambda x: 100 * x / float(x.sum()))

This method takes about ~21 seconds to produce the same result.这种方法需要大约 21 秒才能产生相同的结果。

The result:结果:

      Group 1 Group 2 Final Group  Numbers I want as percents  Percent of Final Group
0        AAAH    AQYR        RMCH                         847               82.312925
1        AAAH    AQYR        XDCL                         182               17.687075
2        AAAH    DQGO        ALVF                         132               12.865497
3        AAAH    DQGO        AVPH                         894               87.134503
4        AAAH    OVGH        NVOO                         650               43.132050
5        AAAH    OVGH        VKQP                         857               56.867950
6        AAAH    VNLY        HYFW                         884               65.336290
7        AAAH    VNLY        MOYH                         469               34.663710
8        AAAH    XOOC        GIDS                         168               23.595506
9        AAAH    XOOC        HTOY                         544               76.404494

The most elegant way to find percentages across columns or index is to use pd.crosstab .跨列或索引查找百分比的最优雅方法是使用pd.crosstab

Sample Data样本数据

df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
               'office_id': list(range(1, 7)) * 2,
               'sales': [np.random.randint(100000, 999999) for _ in range(12)]})

The output dataframe is like this输出数据框是这样的

print(df)

        state   office_id   sales
    0   CA  1   764505
    1   WA  2   313980
    2   CO  3   558645
    3   AZ  4   883433
    4   CA  5   301244
    5   WA  6   752009
    6   CO  1   457208
    7   AZ  2   259657
    8   CA  3   584471
    9   WA  4   122358
    10  CO  5   721845
    11  AZ  6   136928

Just specify the index, columns and the values to aggregate.只需指定要聚合的索引、列和值。 The normalize keyword will calculate % across index or columns depending upon the context. normalize 关键字将根据上下文计算跨索引或列的百分比。

result = pd.crosstab(index=df['state'], 
                     columns=df['office_id'], 
                     values=df['sales'], 
                     aggfunc='sum', 
                     normalize='index').applymap('{:.2f}%'.format)




print(result)
office_id   1   2   3   4   5   6
state                       
AZ  0.00%   0.20%   0.00%   0.69%   0.00%   0.11%
CA  0.46%   0.00%   0.35%   0.00%   0.18%   0.00%
CO  0.26%   0.00%   0.32%   0.00%   0.42%   0.00%
WA  0.00%   0.26%   0.00%   0.10%   0.00%   0.63%

You can sum the whole DataFrame and divide by the state total:您可以将整个DataFrame sum除以state总数:

# Copying setup from Paul H answer
import numpy as np
import pandas as pd
np.random.seed(0)
df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
               'office_id': list(range(1, 7)) * 2,
               'sales': [np.random.randint(100000, 999999) for _ in range(12)]})
# Add a column with the sales divided by state total sales.
df['sales_ratio'] = (df / df.groupby(['state']).transform(sum))['sales']

df

Returns退货

    office_id   sales state  sales_ratio
0           1  405711    CA     0.193319
1           2  535829    WA     0.347072
2           3  217952    CO     0.198743
3           4  252315    AZ     0.192500
4           5  982371    CA     0.468094
5           6  459783    WA     0.297815
6           1  404137    CO     0.368519
7           2  222579    AZ     0.169814
8           3  710581    CA     0.338587
9           4  548242    WA     0.355113
10          5  474564    CO     0.432739
11          6  835831    AZ     0.637686

But note that this only works because all columns other than state are numeric, enabling summation of the entire DataFrame.但请注意,这仅有效,因为除了state之外的所有列都是数字的,从而可以对整个 DataFrame 进行求和。 For example, if office_id is character instead, you get an error:例如,如果office_id改为字符,则会出现错误:

df.office_id = df.office_id.astype(str)
df['sales_ratio'] = (df / df.groupby(['state']).transform(sum))['sales']

TypeError: unsupported operand type(s) for /: 'str' and 'str' TypeError:不支持的操作数类型/:'str'和'str'

import numpy as np
import pandas as pd
np.random.seed(0)
df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
               'office_id': list(range(1, 7)) * 2,
               'sales': [np.random.randint(100000, 999999) for _ in range(12)]})

df.groupby(['state', 'office_id'])['sales'].sum().rename("weightage").groupby(level = 0).transform(lambda x: x/x.sum())
df.reset_index()

Output:输出:

    state   office_id   weightage
0   AZ  2   0.169814
1   AZ  4   0.192500
2   AZ  6   0.637686
3   CA  1   0.193319
4   CA  3   0.338587
5   CA  5   0.468094
6   CO  1   0.368519
7   CO  3   0.198743
8   CO  5   0.432739
9   WA  2   0.347072
10  WA  4   0.355113
11  WA  6   0.297815

我认为这可以在 1 行中解决问题:

df.groupby(['state', 'office_id']).sum().transform(lambda x: x/np.sum(x)*100)

Simple way I have used is a merge after the 2 groupby's then doing simple division.我使用的简单方法是在 2 groupby 之后进行合并,然后进行简单除法。

import numpy as np
import pandas as pd
np.random.seed(0)
df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
               'office_id': list(range(1, 7)) * 2,
               'sales': [np.random.randint(100000, 999999) for _ in range(12)]})

state_office = df.groupby(['state', 'office_id'])['sales'].sum().reset_index()
state = df.groupby(['state'])['sales'].sum().reset_index()
state_office = state_office.merge(state, left_on='state', right_on ='state', how = 'left')
state_office['sales_ratio'] = 100*(state_office['sales_x']/state_office['sales_y'])

   state  office_id  sales_x  sales_y  sales_ratio
0     AZ          2   222579  1310725    16.981365
1     AZ          4   252315  1310725    19.250033
2     AZ          6   835831  1310725    63.768601
3     CA          1   405711  2098663    19.331879
4     CA          3   710581  2098663    33.858747
5     CA          5   982371  2098663    46.809373
6     CO          1   404137  1096653    36.851857
7     CO          3   217952  1096653    19.874290
8     CO          5   474564  1096653    43.273852
9     WA          2   535829  1543854    34.707233
10    WA          4   548242  1543854    35.511259
11    WA          6   459783  1543854    29.781508
df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,
               'office_id': list(range(1, 7)) * 2,
               'sales': [np.random.randint(100000, 999999)
                         for _ in range(12)]})

grouped = df.groupby(['state', 'office_id'])
100*grouped.sum()/df[["state","sales"]].groupby('state').sum()

Returns:回报:

sales
state   office_id   
AZ  2   54.587910
    4   33.009225
    6   12.402865
CA  1   32.046582
    3   44.937684
    5   23.015735
CO  1   21.099989
    3   31.848658
    5   47.051353
WA  2   43.882790
    4   10.265275
    6   45.851935

As someone who is also learning pandas I found the other answers a bit implicit as pandas hides most of the work behind the scenes.作为一个也在学习 pandas 的人,我发现其他答案有点含蓄,因为 pandas 将大部分工作隐藏在幕后。 Namely in how the operation works by automatically matching up column and index names.即通过自动匹配列和索引名称来进行操作。 This code should be equivalent to a step by step version of @exp1orer's accepted answer此代码应相当于@exp1orer 接受的答案的逐步版本

With the df , I'll call it by the alias state_office_sales :使用df ,我将使用别名state_office_sales来调用它:

                  sales
state office_id        
AZ    2          839507
      4          373917
      6          347225
CA    1          798585
      3          890850
      5          454423
CO    1          819975
      3          202969
      5          614011
WA    2          163942
      4          369858
      6          959285

state_total_sales is state_office_sales grouped by total sums in index level 0 (leftmost). state_total_salesstate_office_salesindex level 0 (最左侧)中的总和分组。

In:   state_total_sales = df.groupby(level=0).sum()
      state_total_sales

Out: 
       sales
state   
AZ     2448009
CA     2832270
CO     1495486
WA     595859

Because the two dataframes share an index-name and a column-name pandas will find the appropriate locations through shared indexes like:因为这两个数据框共享一个索引名和一个列名,pandas 将通过共享索引找到适当的位置,例如:

In:   state_office_sales / state_total_sales

Out:  

                   sales
state   office_id   
AZ      2          0.448640
        4          0.125865
        6          0.425496
CA      1          0.288022
        3          0.322169
        5          0.389809
CO      1          0.206684
        3          0.357891
        5          0.435425
WA      2          0.321689
        4          0.346325
        6          0.331986

To illustrate this even better, here is a partial total with a XX that has no equivalent.为了更好地说明这一点,这里有一个没有等价物的XX的部分总数。 Pandas will match the location based on index and column names, where there is no overlap pandas will ignore it: Pandas 将根据索引和列名匹配位置,没有重叠的地方 pandas 将忽略它:

In:   partial_total = pd.DataFrame(
                      data   =  {'sales' : [2448009, 595859, 99999]},
                      index  =             ['AZ',    'WA',   'XX' ]
                      )
      partial_total.index.name = 'state'


Out:  
         sales
state
AZ       2448009
WA       595859
XX       99999
In:   state_office_sales / partial_total

Out: 
                   sales
state   office_id   
AZ      2          0.448640
        4          0.125865
        6          0.425496
CA      1          NaN
        3          NaN
        5          NaN
CO      1          NaN
        3          NaN
        5          NaN
WA      2          0.321689
        4          0.346325
        6          0.331986

This becomes very clear when there are no shared indexes or columns.当没有共享索引或列时,这一点变得非常清楚。 Here missing_index_totals is equal to state_total_sales except that it has a no index-name.这里missing_index_totals等于state_total_sales ,只是它没有索引名称。

In:   missing_index_totals = state_total_sales.rename_axis("")
      missing_index_totals

Out:  
       sales
AZ     2448009
CA     2832270
CO     1495486
WA     595859
In:   state_office_sales / missing_index_totals 

Out:  ValueError: cannot join with no overlapping index names

One-line solution:一站式解决方案:

df.join(
    df.groupby('state').agg(state_total=('sales', 'sum')),
    on='state'
).eval('sales / state_total')

This returns a Series of per-office ratios -- can be used on it's own or assigned to the original Dataframe.这将返回一系列每个办公室的比率——可以单独使用或分配给原始数据框。

df.groupby('state').office_id.value_counts(normalize = True)

我使用了value_counts方法,但它返回的百分比像0.700.30 ,而不像7030

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM