[英]Pandas pivot table using custom conditions on the dataframe
I want to make a pivot table based on custom conditions in the dataframe: 我想根据数据框中的自定义条件制作数据透视表:
The dataframe looks like this: 数据框如下所示:
>>> df = pd.DataFrame({"Area": ["A", "A", "B", "A", "C", "A", "D", "A"],
"City" : ["X", "Y", "Z", "P", "Q", "R", "S", "X"],
"Condition" : ["Good", "Bad", "Good", "Good", "Good", "Bad", "Good", "Good"],
"Population" : [100,150,50,200,170,390,80,100]
"Pincode" : ["X1", "Y1", "Z1", "P1", "Q1", "R1", "S1", "X2"] })
>>> df
Area City Condition Population Pincode
0 A X Good 100 X1
1 A Y Bad 150 Y1
2 B Z Good 50 Z1
3 A P Good 200 P1
4 C Q Good 170 Q1
5 A R Bad 390 R1
6 D S Good 80 S1
7 A X Good 100 X2
Now I want to pivot the dataframe df
in a manner such that I can see the unique count of cities against each area along with the corresponding count of "Good" cities and also the population of the area. 现在我想以一种方式来转动数据框
df
,这样我就可以看到针对每个区域的城市的唯一计数以及相应的“好”城市数量以及该区域的人口数量。
I expect an output like this: 我期待这样的输出:
Area city_count good_city_count Population
A 4 2 940
B 1 1 50
C 1 1 170
D 1 1 80
All 7 5 1240
I can give a dictionary to the aggfunc
parameter but this doesn't give me the city count split between the good cities. 我可以给
aggfunc
参数一个字典,但这并没有给我好城市之间的城市数量。
>>> city_count = df.pivot_table(index=["Area"],
values=["City", "Population"],
aggfunc={"City": lambda x: len(x.unique()),
"Population": "sum"},
margins=True)
Area City Population
0 A 4 940
1 B 1 50
2 C 1 170
3 D 1 80
4 All 7 1240
I can merge two different pivot tables - one with the count of cities and the other with the population but this is not scalable for a large dataset with a big aggfunc
dictionary. 我可以合并两个不同的数据透视表 - 一个具有城市数量,另一个具有总体数量但是对于具有大型
aggfunc
字典的大型数据集而言,这是不可扩展的。
Add new parameters columns
with fill_value
and also is possible use nunique
for aggregate function: 使用
fill_value
添加新参数columns
,也可以使用nunique
作为聚合函数:
city_count = df.pivot_table(index = "Area",
values = "City",
columns='Condition',
aggfunc = lambda x : x.nunique(),
margins = True,
fill_value=0)
print (city_count)
Condition Bad Good All
Area
A 2 2 4
B 0 1 1
C 0 1 1
D 0 1 1
All 2 5 7
Last if need convert index to column and change columns names: 最后如果需要将索引转换为列并更改列名称:
city_count = city_count.add_suffix('_count').reset_index().rename_axis(None, 1)
print (city_count)
Area Bad_count Good_count All_count
0 A 2 2 4
1 B 0 1 1
2 C 0 1 1
3 D 0 1 1
4 All 2 5 7
EDIT: 编辑:
d = {'City':'nunique','Population':'sum', 'good_city_count':'nunique'}
d1 = {'City':'city_count','Condition':'good_city_count'}
mask = df["Condition"] == 'Good'
df = (df.assign(good_city_count = lambda x: np.where(mask, x['City'], np.nan))
.groupby('Area')
.agg(d)
.rename(columns=d1))
df = df.append(df.sum().rename('All')).reset_index()
print (df)
Area city_count Population good_city_count
0 A 4 940 2
1 B 1 50 1
2 C 1 170 1
3 D 1 80 1
4 All 7 1240 5
Another method without using pivot_table
. 不使用
pivot_table
另一种方法。 Use np.where
with groupby
+ agg
: 使用带有
groupby
+ agg
np.where
:
df['Condition'] = np.where(df['Condition']=='Good', df['City'], np.nan)
df = df.groupby('Area').agg({'City':'nunique', 'Condition':'nunique', 'Population':'sum'})\
.rename(columns={'City':'city_count', 'Condition':'good_city_count'})
df.loc['All',:] = df.sum()
df = df.astype(int).reset_index()
print(df)
Area city_count good_city_count Population
0 A 4 2 940
1 B 1 1 50
2 C 1 1 170
3 D 1 1 80
4 All 7 5 1240
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.