简体   繁体   English

Pandas:如果为 NaN,则在分组 dataframe 中移动列

[英]Pandas: shifting columns in grouped dataframe if NaN

I have a grouped dataframe like so:我有一个分组的 dataframe 像这样:

          │          product          │          category           
          │   spot1   spot2    spot3  │   spot1    spot2    spot3 
──────────┼───────────────────────────┼─────────────────────────────
 basket 1 │   NaN     apple    banana │    NaN     fruits   fruits
 basket 2 │   almond  carrot   NaN    │    nuts    veggies  NaN

One row represents a "basket" containing different food products (vegtables, fruits, nuts).一行代表一个“篮子”,里面装着不同的食品(蔬菜、水果、坚果)。

Each basket has 3 spots that can either contain a food product or not (=NaN).每个篮子有 3 个点,可以包含或不包含食品 (=NaN)。

I would like the first column of group product to be as populated as possible.我希望尽可能填充组product的第一列。 That means if there is a NaN value in the first column of the product group and some value in the 2nd or n-th column if should shift to the left for each group.这意味着如果产品组的第一列中存在 NaN 值,并且在第二列或第 n 列中存在某个值,则每个组都应该向左移动。

Categories are related: in the example above a baskets' spot1 of group product and spot1 of group category belong together.类别是相关的:在上面的示例spot1 ,组product的一个baskets'点 1 和spot1 category的点 1 属于一起。 Every data combination must have a value for product.每个数据组合都必须具有产品价值。 If product is NaN then all the related items will be NaN as well.如果产品是 NaN,那么所有相关项目也将是 NaN。

The output should look something like: output 应该类似于:

          │          product          │          category           
          │   spot1   spot2    spot3  │   spot1    spot2    spot3 
──────────┼───────────────────────────┼─────────────────────────────
 basket 1 │   apple   banana   NaN    │    fruits   fruits   NaN  <-- this row shifted to left to "fill" first spot of product group
 basket 2 │   almond  carrot   NaN    │    nuts    veggies  NaN

jezrael's answer here was a good starting point for me: jezrael的回答对我来说是一个很好的起点:

#for each row remove NaNs and create new Series - rows in final df 
df1 = df.apply(lambda x: pd.Series(x.dropna().values), axis=1)
#if possible different number of columns like original df is necessary reindex
df1 = df1.reindex(columns=range(len(df.columns)))
#assign original columns names
df1.columns = df.columns
print (df1)

However, this solution ignores grouping.但是,此解决方案忽略了分组。 I only want values to shift left based on the specific group product .我只希望值根据特定的组product向左移动。


edit / minimal reproducible example编辑/最小可重现示例

please use this code to get to the "starting point" of problem.请使用此代码到达问题的“起点”。 The way I get to this point in my production code is more complex but this should do fine.我在生产代码中达到这一点的方式更复杂,但这应该没问题。

# Import pandas library
import pandas as pd
  
# initialize list of lists
data = [[1, 'NaN','NaN'], [1, 'apple','fruits'], [1,'banana', 'fruits'], [2, 'carrot','veggies'], [2, 'almond','nuts']]
  
# Create the pandas DataFrame
df = pd.DataFrame(data, columns=['basket','product', 'category'])
  
# print dataframe.
df

dfg = df.groupby(['basket', df.groupby(['basket']).cumcount() + 1]).first().unstack().reset_index()

print(dfg)

I trust there is an easier way to accomplish this, but the following should work.我相信有一种更简单的方法可以做到这一点,但以下应该可行。

import pandas as pd
import numpy as np

data = {('product', 'spot1'): {'basket 1': np.nan, 'basket 2': 'almond'}, 
        ('product', 'spot2'): {'basket 1': 'apple', 'basket 2': 'carrot'}, 
        ('product', 'spot3'): {'basket 1': 'banana', 'basket 2': np.nan}, 
        ('category', 'spot1'): {'basket 1': np.nan, 'basket 2': 'nuts'}, 
        ('category', 'spot2'): {'basket 1': 'fruits', 'basket 2': 'veggies'}, 
        ('category', 'spot3'): {'basket 1': 'fruits', 'basket 2': np.nan}}

df = pd.DataFrame(data)

out = df.unstack().dropna()

out.index = pd.MultiIndex.from_arrays([
    out.index.get_level_values(0),
    ('spot' + out.groupby(level=[2,0]).cumcount().add(1).astype(str).to_numpy()),
    out.index.get_level_values(2)])

out = out.reset_index(drop=False).pivot(index='level_2', 
                                        columns=['level_0','level_1'],
                                        values=0)\
    .reindex(df.columns, axis='columns').rename_axis(None, axis=0)

print(out)

         product               category               
           spot1   spot2 spot3    spot1    spot2 spot3
basket 1   apple  banana   NaN   fruits   fruits   NaN
basket 2  almond  carrot   NaN     nuts  veggies   NaN

Explanation解释

  • First, we use df.unstack() with Series.dropna to get a Series with a MultiIndex that consists of col level 0 , col level 1 , index .首先,我们使用df.unstack()Series.dropna来获得一个具有 MultiIndex 的 Series,该MultiIndexcol level 0col level 1index组成。 Ie: IE:
out = df.unstack().dropna()
print(out.head(4))

product   spot1  basket 2    almond
          spot2  basket 1     apple
                 basket 2    carrot
          spot3  basket 1    banana
  • Next, we use df.groupby on levels 0,2 (ie original col level 1 and index ), and we use cumcount to get consecutive numbers for the items in each group.接下来,我们在级别0,2上使用df.groupby (即原始col level 1index ),我们使用cumcount来获取每个组中项目的连续数字。 We add one and turn the result into a string with add(1).astype(str) and prefix "spot".我们添加一个并将结果转换为带有add(1).astype(str)和前缀“spot”的字符串。 Ie we are doing:即我们正在做:
print(('spot' + df.unstack().dropna().groupby(level=[2,0])\
       .cumcount().add(1).astype(str).to_numpy()))

['spot1' 'spot1' 'spot2' 'spot2' 'spot1' 'spot1' 'spot2' 'spot2']
  • We use this result inside pd.MultiIndex.from_arrays to overwrite the MultiIndex (specifically level 1 ) with a new index.我们在pd.MultiIndex.from_arrays中使用这个结果来用新索引覆盖MultiIndex (特别是level 1 )。 Ie we now have:即我们现在有:
print(out.head(4))

product  spot1  basket 2    almond
                basket 1     apple
         spot2  basket 2    carrot
                basket 1    banana
  • Now, finally, we can reset the index and use df.pivot to change the shape of out so that it matches the shape of the original df .现在,最后,我们可以重置索引并使用df.pivot更改out的形状,使其与原始df的形状匹配。 Chaining df.reindex applied to the columns will both reset the order of the columns and add all the missing columns (eg spot3 for both values in col level 0 , and they will be automatically filled with NaNs.链接df.reindex应用于列将重置列的顺序并添加所有丢失的列(例如,对于col level 0中的两个值的spot3 ,它们将自动填充 NaN。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM