简体   繁体   English

如何将 groupby object 转换为列表列表和 append 将新列/值转换为列表中的列表

[英]How can I convert a groupby object to a list of lists and append a new column/value to the list's within the list

I have the following sample df我有以下样本 df

import pandas as pd

list_of_customers =[
[202206,'patrick','lemon','fruit','citrus',10,'tesco'],
[202206,'paul','lemon','fruit','citrus',20,'tesco'],
[202206,'frank','lemon','fruit','citrus',10,'tesco'],
[202206,'jim','lemon','fruit','citrus',20,'tesco'], 
[202206,'wendy','watermelon','fruit','',39,'tesco'],
[202206,'greg','watermelon','fruit','',32,'sainsburys'],
[202209,'wilson','carrot','vegetable','',34,'sainsburys'],    
[202209,'maree','carrot','vegetable','',22,'aldi'],
[202209,'greg','','','','','aldi'], 
[202209,'wilmer','sprite','drink','',22,'aldi'],
[202209,'jed','lime','fruit','citrus',40,'tesco'],    
[202209,'michael','lime','fruit','citrus',12,'aldi'],
[202209,'andrew','','','','33','aldi'], 
[202209,'ahmed','lime','fruit','fruit',33,'aldi'] 
]

df = pd.DataFrame(list_of_customers,columns = ['date','customer','item','item_type','fruit_type','cost','store'])

(df)

I then define variable for each category we need to aggregate然后我为我们需要聚合的每个类别定义变量

fruit_variable = df['item_type'].isin(['fruit'])

vegetable_variable = df['item_type'].isin(['vegetable'])

citrus_variable = df['fruit_type'].isin(['citrus'])


I then want to aggregate each variable and merge them into one dataframe. For each variable I want to have a separate field (variable_number) that has a number assigned to each, so we know what variable rule was used for aggregation.然后,我想聚合每个变量并将它们合并为一个 dataframe。对于每个变量,我希望有一个单独的字段 (variable_number),每个字段都分配了一个数字,这样我们就知道聚合使用了什么变量规则。 So for fruit_variable the field will be '01', vegetable variable will be '02' and so on.因此,对于 fruit_variable,该字段将为“01”,vegetable 变量将为“02”,依此类推。 Note we can't assign a new field with each variable and include it in the grouby fields as there are rows that would not be mutually exclusive (ie rows need to aggregate for both the fruit_variable and citrus_variable).请注意,我们不能为每个变量分配一个新字段并将其包含在 grouby 字段中,因为有些行不会相互排斥(即行需要聚合 fruit_variable 和 citrus_variable)。

list_agg = df.where(fruit_variable).groupby(['date','store'])[['cost']].sum().reset_index().agg(list),
df.where(vegetable_variable).groupby(['date','store'])[['cost']].sum().reset_index().agg(list),
df.where(citrus_variable).groupby(['date','store'])[['cost']].sum().reset_index().agg(list)

print(list_agg)
type(list_agg)

df_agg = pd.DataFrame(list_Agg, columns = ['date','store','cost'])
print(df_agg)

I am having trouble converting the tuple to a dataframe.我无法将元组转换为 dataframe。

I can convert the groupby object's to lists using.to_records().tolist() but it still leaves me the problem of how to add the new row with the variable number.我可以使用 .to_records().tolist() 将 groupby 对象转换为列表,但它仍然给我留下了如何添加具有可变编号的新行的问题。

Note this is a much smaller subset of the actual problem.请注意,这是实际问题的一小部分。 I am hoping to get a dataframe looking like below in this example:我希望在此示例中获得如下所示的 dataframe:

在此处输入图像描述

Please let me know if any further information is required.如果需要任何进一步的信息,请告诉我。

EDIT1:编辑1:

I'm adding an additional variable to the example so that the groupby includes a value which will return a null cost.我在示例中添加了一个附加变量,以便 groupby 包含一个将返回 null 成本的值。 Rather than having an incremental variable_number how can we define it so that the variable numbers are predefined (fruit variable is '01', vegetable_variable is '02'citrus_variable is '01a', meat_variable is '03')我们如何定义变量编号而不是增量变量编号,以便预定义变量编号(水果变量为“01”,蔬菜变量为“02”,柑橘变量为“01a”,肉类变量为“03”)

fruit_variable = df['item_type'].isin(['fruit'])

vegetable_variable = df['item_type'].isin(['vegetable'])

citrus_variable = df['fruit_type'].isin(['citrus'])

meat_variable = df['item_type'].isin(['poultry'])


list_agg = [df.where(fruit_variable).groupby(['date','store'])[['cost']].sum().reset_index().agg(list),
            df.where(vegetable_variable).groupby(['date','store'])[['cost']].sum().reset_index().agg(list),
            df.where(citrus_variable).groupby(['date','store'])[['cost']].sum().reset_index().agg(list),
            df.where(meat_variable).groupby(['date','store'])[['cost']].sum().reset_index().agg(list)]

out = (pd.concat(list_agg, keys=[f'{v+1:02}' for v in range(len(list_agg))])
         .rename_axis(['variable_number', None])
         .reset_index('variable_number').reset_index(drop=True))

We are then looking for output without having a failure for the meat_variable which we return no values in the sample dataset然后我们正在寻找 output 而没有 meat_variable 失败,我们在示例数据集中不返回任何值

在此处输入图像描述

IIUC, you can use concat : IIUC,你可以使用concat

list_agg = [df.where(fruit_variable).groupby(['date','store'])[['cost']].sum().reset_index().agg(list),
            df.where(vegetable_variable).groupby(['date','store'])[['cost']].sum().reset_index().agg(list),
            df.where(citrus_variable).groupby(['date','store'])[['cost']].sum().reset_index().agg(list)]

out = (pd.concat(list_agg, keys=[f'{v+1:02}' for v in range(len(list_agg))])
         .rename_axis(['variable_number', None])
         .reset_index('variable_number').reset_index(drop=True))

Output: Output:

>>> out
  variable_number      date       store  cost
0              01  202206.0  sainsburys    32
1              01  202206.0       tesco    99
2              01  202209.0        aldi    45
3              01  202209.0       tesco    40
4              02  202209.0        aldi    22
5              02  202209.0  sainsburys    34
6              03  202206.0       tesco    60
7              03  202209.0        aldi    12
8              03  202209.0       tesco    40

The exact logic is unclear, but you might want to use concat with a list comprehension of groupby.agg :确切的逻辑尚不清楚,但您可能希望将concatgroupby.agg的列表理解一起使用:

variables = {'01': df['item_type'].isin(['fruit']),
             '02': df['item_type'].isin(['vegetable']),
             '03': df['fruit_type'].isin(['citrus']),
            }

out = (pd.concat({k: df[m].groupby(['date', 'store'], as_index=False)['cost'].sum()
                  for k, m in variables.items()}, names=['variable_number', None])
         .reset_index('variable_number')
      )

print(out)

Output: Output:

  variable_number    date       store  cost
0              01  202206  sainsburys    32
1              01  202206       tesco    99
2              01  202209        aldi    45
3              01  202209       tesco    40
0              02  202209        aldi    22
1              02  202209  sainsburys    34
0              03  202206       tesco    60
1              03  202209        aldi    12
2              03  202209       tesco    40

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM