[英]How can I create new columns with groupby in pandas?
I have a pandas dataframe like this,我有一个像这样的熊猫数据框,
>>> data = {
'hotel_code': [1, 1, 1, 1, 1],
'feed': [1, 1, 1, 1, 2],
'price_euro': [100, 200, 250, 120, 130],
'client_nationality': ['fr', 'us', 'ru,de', 'gb', 'cn,us,br,il,fr,gb,de,ie,pk,pl']
}
>>> df = pd.DataFrame(data)
>>> df
hotel_code feed price_euro client_nationality
0 1 1 100 fr
1 1 1 200 us
2 1 1 250 ru,de
3 1 1 120 gb
4 1 2 130 cn,us,br,il,fr,gb,de,ie,pk,pl
And here is expected output,这是预期的输出,
>>> data = {
'hotel_code': [1, 1],
'feed': [1, 2],
'cluster1': ['fr', 'cn,us,br,il,fr,gb,de,ie,pk,pl'],
'cluster2': ['us', np.nan],
'cluster3': ['ru,de', np.nan],
'cluster4': ['gb', np.nan],
}
>>> df = pd.DataFrame(data)
>>> df
hotel_code feed cluster1 cluster2 cluster3 cluster4
0 1 1 fr us ru,de gb
1 1 2 cn,us,br,il,fr,gb,de,ie,pk,pl NaN NaN NaN
I want to create cluster columns by unique hotel_code
and feed
but I have no idea.我想通过唯一的
hotel_code
和feed
创建集群列,但我不知道。 Cluster numbers are changeable.簇号是可变的。 Any idea?
任何的想法? Thanks in advance.
提前致谢。
Use GroupBy.cumcount
for counter per groups, create MultiIndex by hotel_code
with feed
and counter Series
and reshape by Series.unstack
, last rename
columns and DataFrame.reset_index
for MultiIndex
to columns:使用
GroupBy.cumcount
作为每个组的计数器,通过带有feed
和计数器Series
的hotel_code
创建 MultiIndex 并通过Series.unstack
重塑,最后rename
列,并将DataFrame.reset_index
的MultiIndex
rename
为列:
g = df.groupby(["hotel_code", "feed"]).cumcount()
df1 = (df.set_index(["hotel_code", "feed", g])['client_nationality']
.unstack()
.rename(columns = lambda x: f'cluster_{x+1}')
.reset_index())
print (df1)
hotel_code feed cluster_1 cluster_2 cluster_3 \
0 1 1 fr us ru,de
1 1 2 cn,us,br,il,fr,gb,de,ie,pk,pl NaN NaN
cluster_4
0 gb
1 NaN
You could create a new dataframe with your clusters:您可以使用集群创建一个新的数据框:
clusters = pd.DataFrame(
df.groupby(["hotel_code", "feed"])
.agg(list)
.reset_index()
.client_nationality.tolist()
)
clusters.columns = [f"cluster_{i}" for i in range(1, clusters.shape[1] + 1)]
Then,然后,
pd.concat(
[
df.drop(["price_euro", "client_nationality"], axis=1)
.drop_duplicates(["hotel_code", "feed"])
.reset_index(drop=True),
clusters,
],
axis=1,
)
would return会回来
hotel_code feed cluster_1 cluster_2 cluster_3 cluster_4
0 1 1 fr us ru,de gb
1 1 2 cn,us,br,il,fr,gb,de,ie,pk,pl None None None
Groupby on hotel_code
and feed
, then aggregate on client_nationality
and finally split and expand. hotel_code
在hotel_code
和feed
,然后在client_nationality
聚合,最后拆分和扩展。
Update columns with required suffix.更新具有所需后缀的列。
df.groupby(['hotel_code', 'feed'])['client_nationality']
.agg(' '.join)
.str.split(' ', expand=True)
.rename(columns = lambda x: f'cluster_{x+1}')
Output输出
cluster_1 cluster_2 cluster_3 cluster_4
hotel_code feed
1 1 fr us ru,de gb
2 cn,us,br,il,fr,gb,de,ie,pk,pl None None None
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.