简体   繁体   English

Pandas透视表使用数据框上的自定义条件

[英]Pandas pivot table using custom conditions on the dataframe

I want to make a pivot table based on custom conditions in the dataframe: 我想根据数据框中的自定义条件制作数据透视表:

The dataframe looks like this: 数据框如下所示:

>>> df = pd.DataFrame({"Area": ["A", "A", "B", "A", "C", "A", "D", "A"],
                       "City" : ["X", "Y", "Z", "P", "Q", "R", "S", "X"],
                       "Condition" : ["Good", "Bad", "Good", "Good", "Good", "Bad", "Good", "Good"], 
                       "Population" : [100,150,50,200,170,390,80,100]
                       "Pincode" : ["X1", "Y1", "Z1", "P1", "Q1", "R1", "S1", "X2"] })
>>> df
  Area City Condition   Population Pincode
 0    A    X      Good   100       X1
 1    A    Y       Bad   150       Y1
 2    B    Z      Good   50        Z1
 3    A    P      Good   200       P1
 4    C    Q      Good   170       Q1
 5    A    R       Bad   390       R1
 6    D    S      Good   80        S1
 7    A    X      Good   100       X2

Now I want to pivot the dataframe df in a manner such that I can see the unique count of cities against each area along with the corresponding count of "Good" cities and also the population of the area. 现在我想以一种方式来转动数据框df ,这样我就可以看到针对每个区域的城市的唯一计数以及相应的“好”城市数量以及该区域的人口数量。

I expect an output like this: 我期待这样的输出:

Area  city_count  good_city_count   Population
A        4        2                 940
B        1        1                 50
C        1        1                 170
D        1        1                 80
All      7        5                 1240

I can give a dictionary to the aggfunc parameter but this doesn't give me the city count split between the good cities. 我可以给aggfunc参数一个字典,但这并没有给我好城市之间的城市数量。

>>> city_count = df.pivot_table(index=["Area"],
                                values=["City", "Population"],
                                aggfunc={"City": lambda x: len(x.unique()),
                                         "Population": "sum"},
                                margins=True)

    Area    City    Population
0   A       4       940
1   B       1       50
2   C       1       170
3   D       1       80
4   All     7       1240

I can merge two different pivot tables - one with the count of cities and the other with the population but this is not scalable for a large dataset with a big aggfunc dictionary. 我可以合并两个不同的数据透视表 - 一个具有城市数量,另一个具有总体数量但是对于具有大型aggfunc字典的大型数据集而言,这是不可扩展的。

Add new parameters columns with fill_value and also is possible use nunique for aggregate function: 使用fill_value添加新参数columns ,也可以使用nunique作为聚合函数:

city_count = df.pivot_table(index = "Area", 
                            values = "City", 
                            columns='Condition', 
                            aggfunc = lambda x : x.nunique(), 
                            margins = True,
                            fill_value=0)
print (city_count)
Condition  Bad  Good  All
Area                     
A            2     2    4
B            0     1    1
C            0     1    1
D            0     1    1
All          2     5    7

Last if need convert index to column and change columns names: 最后如果需要将索引转换为列并更改列名称:

city_count = city_count.add_suffix('_count').reset_index().rename_axis(None, 1)
print (city_count)
  Area  Bad_count  Good_count  All_count
0    A          2           2          4
1    B          0           1          1
2    C          0           1          1
3    D          0           1          1
4  All          2           5          7

EDIT: 编辑:

d = {'City':'nunique','Population':'sum', 'good_city_count':'nunique'}
d1 = {'City':'city_count','Condition':'good_city_count'}

mask = df["Condition"] == 'Good'
df = (df.assign(good_city_count = lambda x: np.where(mask, x['City'], np.nan))
       .groupby('Area')
       .agg(d)
       .rename(columns=d1))

df = df.append(df.sum().rename('All')).reset_index()

print (df)
  Area  city_count  Population  good_city_count
0    A           4         940                2
1    B           1          50                1
2    C           1         170                1
3    D           1          80                1
4  All           7        1240                5

Another method without using pivot_table . 不使用pivot_table另一种方法。 Use np.where with groupby + agg : 使用带有groupby + agg np.where

df['Condition'] = np.where(df['Condition']=='Good', df['City'], np.nan)
df = df.groupby('Area').agg({'City':'nunique', 'Condition':'nunique', 'Population':'sum'})\
                       .rename(columns={'City':'city_count', 'Condition':'good_city_count'})
df.loc['All',:] = df.sum()
df = df.astype(int).reset_index()

print(df)
  Area  city_count  good_city_count  Population
0    A           4                2         940
1    B           1                1          50
2    C           1                1         170
3    D           1                1          80
4  All           7                5        1240

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM