如何在 Pandas 中使用 groupby 创建新列？

Question

I have a pandas dataframe like this,我有一个像这样的熊猫数据框，

>>> data = {
    'hotel_code': [1, 1, 1, 1, 1],
    'feed': [1, 1, 1, 1, 2],
    'price_euro': [100, 200, 250, 120, 130],
    'client_nationality': ['fr', 'us', 'ru,de', 'gb', 'cn,us,br,il,fr,gb,de,ie,pk,pl']
}
>>> df = pd.DataFrame(data)
>>> df
   hotel_code  feed  price_euro             client_nationality
0           1     1         100                             fr
1           1     1         200                             us
2           1     1         250                          ru,de
3           1     1         120                             gb
4           1     2         130  cn,us,br,il,fr,gb,de,ie,pk,pl

And here is expected output,这是预期的输出，

>>> data = {
    'hotel_code': [1, 1],
    'feed': [1, 2],
    'cluster1': ['fr', 'cn,us,br,il,fr,gb,de,ie,pk,pl'],
    'cluster2': ['us', np.nan],
    'cluster3': ['ru,de', np.nan],
    'cluster4': ['gb', np.nan],
}
>>> df = pd.DataFrame(data)
>>> df
   hotel_code  feed                       cluster1 cluster2 cluster3 cluster4
0           1     1                             fr       us    ru,de       gb
1           1     2  cn,us,br,il,fr,gb,de,ie,pk,pl      NaN      NaN      NaN

I want to create cluster columns by unique hotel_code and feed but I have no idea.我想通过唯一的hotel_code和feed创建集群列，但我不知道。 Cluster numbers are changeable.簇号是可变的。 Any idea?任何的想法？ Thanks in advance.提前致谢。

Answer 1

Use GroupBy.cumcount for counter per groups, create MultiIndex by hotel_code with feed and counter Series and reshape by Series.unstack , last rename columns and DataFrame.reset_index for MultiIndex to columns:使用GroupBy.cumcount作为每个组的计数器，通过带有feed和计数器Series的hotel_code创建 MultiIndex 并通过Series.unstack重塑，最后rename列，并将DataFrame.reset_index的MultiIndex rename为列：

g = df.groupby(["hotel_code", "feed"]).cumcount()

df1 = (df.set_index(["hotel_code", "feed", g])['client_nationality']
         .unstack()
         .rename(columns = lambda x: f'cluster_{x+1}')
         .reset_index())
print (df1)
   hotel_code  feed                      cluster_1 cluster_2 cluster_3  \
0           1     1                             fr        us     ru,de   
1           1     2  cn,us,br,il,fr,gb,de,ie,pk,pl       NaN       NaN   

  cluster_4  
0        gb  
1       NaN

Answer 2

You could create a new dataframe with your clusters:您可以使用集群创建一个新的数据框：

clusters = pd.DataFrame(
    df.groupby(["hotel_code", "feed"])
    .agg(list)
    .reset_index()
    .client_nationality.tolist()
)
clusters.columns = [f"cluster_{i}" for i in range(1, clusters.shape[1] + 1)]

Then,然后，

pd.concat(
    [
        df.drop(["price_euro", "client_nationality"], axis=1)
        .drop_duplicates(["hotel_code", "feed"])
        .reset_index(drop=True),
        clusters,
    ],
    axis=1,
)

would return会回来

   hotel_code  feed                      cluster_1 cluster_2 cluster_3 cluster_4
0           1     1                             fr        us     ru,de        gb
1           1     2  cn,us,br,il,fr,gb,de,ie,pk,pl      None      None      None

Answer 3

Groupby on hotel_code and feed , then aggregate on client_nationality and finally split and expand. hotel_code在hotel_code和feed ，然后在client_nationality聚合，最后拆分和扩展。

Update columns with required suffix.更新具有所需后缀的列。

df.groupby(['hotel_code', 'feed'])['client_nationality']
  .agg(' '.join)
  .str.split(' ', expand=True)
  .rename(columns = lambda x: f'cluster_{x+1}')

Output输出

                                     cluster_1 cluster_2 cluster_3 cluster_4
hotel_code feed                                                             
1          1                                fr        us     ru,de        gb
           2     cn,us,br,il,fr,gb,de,ie,pk,pl      None      None      None

如何在 Pandas 中使用 groupby 创建新列？

问题描述

3 个解决方案

解决方案1
5 已采纳 2020-01-08 13:42:34

解决方案2
3 2020-01-08 13:34:41

解决方案3
3 2020-01-08 13:52:47

如何在 Pandas 中使用 groupby 创建新列？

问题描述

3 个解决方案

解决方案1 5 已采纳 2020-01-08 13:42:34

解决方案2 3 2020-01-08 13:34:41

解决方案3 3 2020-01-08 13:52:47

解决方案1
5 已采纳 2020-01-08 13:42:34

解决方案2
3 2020-01-08 13:34:41

解决方案3
3 2020-01-08 13:52:47