[英]Pandas: shifting columns in grouped dataframe if NaN
I have a grouped dataframe like so:我有一个分组的 dataframe 像这样:
│ product │ category
│ spot1 spot2 spot3 │ spot1 spot2 spot3
──────────┼───────────────────────────┼─────────────────────────────
basket 1 │ NaN apple banana │ NaN fruits fruits
basket 2 │ almond carrot NaN │ nuts veggies NaN
One row represents a "basket" containing different food products (vegtables, fruits, nuts).一行代表一个“篮子”,里面装着不同的食品(蔬菜、水果、坚果)。
Each basket has 3 spots that can either contain a food product or not (=NaN).每个篮子有 3 个点,可以包含或不包含食品 (=NaN)。
I would like the first column of group product
to be as populated as possible.我希望尽可能填充组product
的第一列。 That means if there is a NaN value in the first column of the product group and some value in the 2nd or n-th column if should shift to the left for each group.这意味着如果产品组的第一列中存在 NaN 值,并且在第二列或第 n 列中存在某个值,则每个组都应该向左移动。
Categories are related: in the example above a baskets'
spot1
of group product
and spot1
of group category
belong together.类别是相关的:在上面的示例spot1
,组product
的一个baskets'
点 1 和spot1
category
的点 1 属于一起。 Every data combination must have a value for product.每个数据组合都必须具有产品价值。 If product is NaN then all the related items will be NaN as well.如果产品是 NaN,那么所有相关项目也将是 NaN。
The output should look something like: output 应该类似于:
│ product │ category
│ spot1 spot2 spot3 │ spot1 spot2 spot3
──────────┼───────────────────────────┼─────────────────────────────
basket 1 │ apple banana NaN │ fruits fruits NaN <-- this row shifted to left to "fill" first spot of product group
basket 2 │ almond carrot NaN │ nuts veggies NaN
jezrael's answer here was a good starting point for me: jezrael的回答对我来说是一个很好的起点:
#for each row remove NaNs and create new Series - rows in final df
df1 = df.apply(lambda x: pd.Series(x.dropna().values), axis=1)
#if possible different number of columns like original df is necessary reindex
df1 = df1.reindex(columns=range(len(df.columns)))
#assign original columns names
df1.columns = df.columns
print (df1)
However, this solution ignores grouping.但是,此解决方案忽略了分组。 I only want values to shift left based on the specific group product
.我只希望值根据特定的组product
向左移动。
please use this code to get to the "starting point" of problem.请使用此代码到达问题的“起点”。 The way I get to this point in my production code is more complex but this should do fine.我在生产代码中达到这一点的方式更复杂,但这应该没问题。
# Import pandas library
import pandas as pd
# initialize list of lists
data = [[1, 'NaN','NaN'], [1, 'apple','fruits'], [1,'banana', 'fruits'], [2, 'carrot','veggies'], [2, 'almond','nuts']]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns=['basket','product', 'category'])
# print dataframe.
df
dfg = df.groupby(['basket', df.groupby(['basket']).cumcount() + 1]).first().unstack().reset_index()
print(dfg)
I trust there is an easier way to accomplish this, but the following should work.我相信有一种更简单的方法可以做到这一点,但以下应该可行。
import pandas as pd
import numpy as np
data = {('product', 'spot1'): {'basket 1': np.nan, 'basket 2': 'almond'},
('product', 'spot2'): {'basket 1': 'apple', 'basket 2': 'carrot'},
('product', 'spot3'): {'basket 1': 'banana', 'basket 2': np.nan},
('category', 'spot1'): {'basket 1': np.nan, 'basket 2': 'nuts'},
('category', 'spot2'): {'basket 1': 'fruits', 'basket 2': 'veggies'},
('category', 'spot3'): {'basket 1': 'fruits', 'basket 2': np.nan}}
df = pd.DataFrame(data)
out = df.unstack().dropna()
out.index = pd.MultiIndex.from_arrays([
out.index.get_level_values(0),
('spot' + out.groupby(level=[2,0]).cumcount().add(1).astype(str).to_numpy()),
out.index.get_level_values(2)])
out = out.reset_index(drop=False).pivot(index='level_2',
columns=['level_0','level_1'],
values=0)\
.reindex(df.columns, axis='columns').rename_axis(None, axis=0)
print(out)
product category
spot1 spot2 spot3 spot1 spot2 spot3
basket 1 apple banana NaN fruits fruits NaN
basket 2 almond carrot NaN nuts veggies NaN
Explanation解释
df.unstack()
with Series.dropna
to get a Series with a MultiIndex
that consists of col level 0
, col level 1
, index
.首先,我们使用df.unstack()
和Series.dropna
来获得一个具有 MultiIndex 的 Series,该MultiIndex
由col level 0
、 col level 1
、 index
组成。 Ie: IE:out = df.unstack().dropna()
print(out.head(4))
product spot1 basket 2 almond
spot2 basket 1 apple
basket 2 carrot
spot3 basket 1 banana
df.groupby
on levels 0,2
(ie original col level 1
and index
), and we use cumcount
to get consecutive numbers for the items in each group.接下来,我们在级别0,2
上使用df.groupby
(即原始col level 1
和index
),我们使用cumcount
来获取每个组中项目的连续数字。 We add one and turn the result into a string with add(1).astype(str)
and prefix "spot".我们添加一个并将结果转换为带有add(1).astype(str)
和前缀“spot”的字符串。 Ie we are doing:即我们正在做:print(('spot' + df.unstack().dropna().groupby(level=[2,0])\
.cumcount().add(1).astype(str).to_numpy()))
['spot1' 'spot1' 'spot2' 'spot2' 'spot1' 'spot1' 'spot2' 'spot2']
pd.MultiIndex.from_arrays
to overwrite the MultiIndex
(specifically level 1
) with a new index.我们在pd.MultiIndex.from_arrays
中使用这个结果来用新索引覆盖MultiIndex
(特别是level 1
)。 Ie we now have:即我们现在有:print(out.head(4))
product spot1 basket 2 almond
basket 1 apple
spot2 basket 2 carrot
basket 1 banana
df.pivot
to change the shape of out
so that it matches the shape of the original df
.现在,最后,我们可以重置索引并使用df.pivot
更改out
的形状,使其与原始df
的形状匹配。 Chaining df.reindex
applied to the columns will both reset the order of the columns and add all the missing columns (eg spot3
for both values in col level 0
, and they will be automatically filled with NaNs.链接df.reindex
应用于列将重置列的顺序并添加所有丢失的列(例如,对于col level 0
中的两个值的spot3
,它们将自动填充 NaN。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.