[英]How to sum each row column values to subsequent row column values and create new dataframe with new and old rows
我有不同产品 id、对应 prod_descriptions 和数量的大数据框。有些产品 id 具有子产品(prod_desc2、prod_desc3...等)要么没有产品 id 要么没有映射到主产品 id(1111,333)并且其值在 prod_id 列中为空,如示例 DF 所示。
Sample DF
prod_id prod_description col1 col2 col3 col4 col5
1111 prod_desc1 10 20 30 45 25
prod_desc2 15 17 16 28 nan
prod_desc3 15 17 5 nan nan
2222 prod_desc1 5 10 15 7 10
2223 prod_desc1 15 10 25 10 10
333 prod_desc1 10 15 20 23 25
prod_desc2 25 5 25 10 nan
我想将 prod_desc2 和 prod_desc3 的数量汇总到 prod_desc1 级别,并创建一个新的 DF 以及在所需输出中显示的其他 prod _id(2222,2223)。这样每个产品 ID 将有其子产品的一行累积总和.
Desired Output
prod_id prod_description col1 col2 col3 col4 col5
1111 prod_desc1 40 54 51 73 25
2222 prod_desc1 5 10 15 7 10
2223 prod_desc1 15 10 25 10 10
333 prod_desc1 35 20 45 33 25
下面是我尝试过的“部分”代码,但在对 prod id 行和 no_prod_id 列的列值求和并将它们与其他 prod_id 一起保存在新数据框中时遇到了麻烦。请执行需要。
Empty rows were filled with no_prod_id
prod_id prod_description col1 col2 col3 col4 col5
1111 prod_desc1 10 20 30 45 25
no_prod_id prod_desc2 15 17 16 28 nan
no_prod_id prod_desc3 15 17 5 nan nan
2222 prod_desc1 5 10 15 7 10
2223 prod_desc1 15 10 25 10 10
333 prod_desc1 10 15 20 23 25
no_prod_id prod_desc2 25 5 25 10 nan
null_value_count=[]
rolled_up_values=[]
for i in df.index:
if df.iloc[i,0]=="no_prod_id": #pick no_prod_id row
x=df.iloc[i,:] #save null value row
if x.isnull().sum().sum()==df.shape[1]: # check if no_prod_id is having all nulls
null_value_cunt.append(i) #save index for later deleting it from DF
else:
if df.iloc[i-1,0]!= "no_prod_id": #check previus row has main prod id
y=df.iloc[i-1,:] # save main prod id row
for val in range(1,len(y)): #get each value of main prod id
rolled_up_values.append(x[val]+y[val]) #sum with no_prod_id value save the out in
#list for updating in a new DF
第一次ffill
df['prod_id'] = df['prod_id'].ffill()
print(df)
prod_id prod_description col1 col2 col3 col4 col5
0 1111.0 prod_desc1 10 20 30 45.0 25.0
1 1111.0 prod_desc2 15 17 16 28.0 NaN
2 1111.0 prod_desc3 15 17 5 NaN NaN
3 2222.0 prod_desc1 5 10 15 7.0 10.0
4 2223.0 prod_desc1 15 10 25 10.0 10.0
5 333.0 prod_desc1 10 15 20 23.0 25.0
6 333.0 prod_desc2 25 5 25 10.0 NaN
然后我们删除您的 prod_description 和 groupby 剩余的列,
df_new = df.drop('prod_description',axis=1).groupby('prod_id').sum().reset_index()
df_new.insert(1,'prod_description','prod_desc1') # reinsert columns.
结果,请注意我刚刚添加了自定义排序以匹配您的输出。
idx = df_new['prod_id'].astype(str).str[1].astype(int).sort_values().index
print(df_new.loc[idx])
prod_id prod_description col1 col2 col3 col4 col5
1 1111.0 prod_desc1 40 54 51 73.0 25.0
2 2222.0 prod_desc1 5 10 15 7.0 10.0
3 2223.0 prod_desc1 15 10 25 10.0 10.0
0 333.0 prod_desc1 35 20 45 33.0 25.0
或者如 anky_91 善意指出的那样,我们可以通过使用.assign
和sort=False
将代码行减少到简单的两行
df['prod_id'] = df['prod_id'].ffill()
df.groupby("prod_id", sort=False, as_index=False).sum().assign(
prod_description="prod_desc1"
).reindex(df.columns, axis=1)
结果
prod_id prod_description col1 col2 col3 col4 col5
0 1111.0 prod_desc1 40 54 51 73.0 25.0
1 2222.0 prod_desc1 5 10 15 7.0 10.0
2 2223.0 prod_desc1 15 10 25 10.0 10.0
3 333.0 prod_desc1 35 20 45 33.0 25.0
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.