[英]How can I convert a groupby object to a list of lists and append a new column/value to the list's within the list
I have the following sample df我有以下样本 df
import pandas as pd
list_of_customers =[
[202206,'patrick','lemon','fruit','citrus',10,'tesco'],
[202206,'paul','lemon','fruit','citrus',20,'tesco'],
[202206,'frank','lemon','fruit','citrus',10,'tesco'],
[202206,'jim','lemon','fruit','citrus',20,'tesco'],
[202206,'wendy','watermelon','fruit','',39,'tesco'],
[202206,'greg','watermelon','fruit','',32,'sainsburys'],
[202209,'wilson','carrot','vegetable','',34,'sainsburys'],
[202209,'maree','carrot','vegetable','',22,'aldi'],
[202209,'greg','','','','','aldi'],
[202209,'wilmer','sprite','drink','',22,'aldi'],
[202209,'jed','lime','fruit','citrus',40,'tesco'],
[202209,'michael','lime','fruit','citrus',12,'aldi'],
[202209,'andrew','','','','33','aldi'],
[202209,'ahmed','lime','fruit','fruit',33,'aldi']
]
df = pd.DataFrame(list_of_customers,columns = ['date','customer','item','item_type','fruit_type','cost','store'])
(df)
I then define variable for each category we need to aggregate然后我为我们需要聚合的每个类别定义变量
fruit_variable = df['item_type'].isin(['fruit'])
vegetable_variable = df['item_type'].isin(['vegetable'])
citrus_variable = df['fruit_type'].isin(['citrus'])
I then want to aggregate each variable and merge them into one dataframe. For each variable I want to have a separate field (variable_number) that has a number assigned to each, so we know what variable rule was used for aggregation.然后,我想聚合每个变量并将它们合并为一个 dataframe。对于每个变量,我希望有一个单独的字段 (variable_number),每个字段都分配了一个数字,这样我们就知道聚合使用了什么变量规则。 So for fruit_variable the field will be '01', vegetable variable will be '02' and so on.
因此,对于 fruit_variable,该字段将为“01”,vegetable 变量将为“02”,依此类推。 Note we can't assign a new field with each variable and include it in the grouby fields as there are rows that would not be mutually exclusive (ie rows need to aggregate for both the fruit_variable and citrus_variable).
请注意,我们不能为每个变量分配一个新字段并将其包含在 grouby 字段中,因为有些行不会相互排斥(即行需要聚合 fruit_variable 和 citrus_variable)。
list_agg = df.where(fruit_variable).groupby(['date','store'])[['cost']].sum().reset_index().agg(list),
df.where(vegetable_variable).groupby(['date','store'])[['cost']].sum().reset_index().agg(list),
df.where(citrus_variable).groupby(['date','store'])[['cost']].sum().reset_index().agg(list)
print(list_agg)
type(list_agg)
df_agg = pd.DataFrame(list_Agg, columns = ['date','store','cost'])
print(df_agg)
I am having trouble converting the tuple to a dataframe.我无法将元组转换为 dataframe。
I can convert the groupby object's to lists using.to_records().tolist() but it still leaves me the problem of how to add the new row with the variable number.我可以使用 .to_records().tolist() 将 groupby 对象转换为列表,但它仍然给我留下了如何添加具有可变编号的新行的问题。
Note this is a much smaller subset of the actual problem.请注意,这是实际问题的一小部分。 I am hoping to get a dataframe looking like below in this example:
我希望在此示例中获得如下所示的 dataframe:
Please let me know if any further information is required.如果需要任何进一步的信息,请告诉我。
EDIT1:编辑1:
I'm adding an additional variable to the example so that the groupby includes a value which will return a null cost.我在示例中添加了一个附加变量,以便 groupby 包含一个将返回 null 成本的值。 Rather than having an incremental variable_number how can we define it so that the variable numbers are predefined (fruit variable is '01', vegetable_variable is '02'citrus_variable is '01a', meat_variable is '03')
我们如何定义变量编号而不是增量变量编号,以便预定义变量编号(水果变量为“01”,蔬菜变量为“02”,柑橘变量为“01a”,肉类变量为“03”)
fruit_variable = df['item_type'].isin(['fruit'])
vegetable_variable = df['item_type'].isin(['vegetable'])
citrus_variable = df['fruit_type'].isin(['citrus'])
meat_variable = df['item_type'].isin(['poultry'])
list_agg = [df.where(fruit_variable).groupby(['date','store'])[['cost']].sum().reset_index().agg(list),
df.where(vegetable_variable).groupby(['date','store'])[['cost']].sum().reset_index().agg(list),
df.where(citrus_variable).groupby(['date','store'])[['cost']].sum().reset_index().agg(list),
df.where(meat_variable).groupby(['date','store'])[['cost']].sum().reset_index().agg(list)]
out = (pd.concat(list_agg, keys=[f'{v+1:02}' for v in range(len(list_agg))])
.rename_axis(['variable_number', None])
.reset_index('variable_number').reset_index(drop=True))
We are then looking for output without having a failure for the meat_variable which we return no values in the sample dataset然后我们正在寻找 output 而没有 meat_variable 失败,我们在示例数据集中不返回任何值
IIUC, you can use concat
: IIUC,你可以使用
concat
:
list_agg = [df.where(fruit_variable).groupby(['date','store'])[['cost']].sum().reset_index().agg(list),
df.where(vegetable_variable).groupby(['date','store'])[['cost']].sum().reset_index().agg(list),
df.where(citrus_variable).groupby(['date','store'])[['cost']].sum().reset_index().agg(list)]
out = (pd.concat(list_agg, keys=[f'{v+1:02}' for v in range(len(list_agg))])
.rename_axis(['variable_number', None])
.reset_index('variable_number').reset_index(drop=True))
Output: Output:
>>> out
variable_number date store cost
0 01 202206.0 sainsburys 32
1 01 202206.0 tesco 99
2 01 202209.0 aldi 45
3 01 202209.0 tesco 40
4 02 202209.0 aldi 22
5 02 202209.0 sainsburys 34
6 03 202206.0 tesco 60
7 03 202209.0 aldi 12
8 03 202209.0 tesco 40
The exact logic is unclear, but you might want to use concat
with a list comprehension of groupby.agg
:确切的逻辑尚不清楚,但您可能希望将
concat
与groupby.agg
的列表理解一起使用:
variables = {'01': df['item_type'].isin(['fruit']),
'02': df['item_type'].isin(['vegetable']),
'03': df['fruit_type'].isin(['citrus']),
}
out = (pd.concat({k: df[m].groupby(['date', 'store'], as_index=False)['cost'].sum()
for k, m in variables.items()}, names=['variable_number', None])
.reset_index('variable_number')
)
print(out)
Output: Output:
variable_number date store cost
0 01 202206 sainsburys 32
1 01 202206 tesco 99
2 01 202209 aldi 45
3 01 202209 tesco 40
0 02 202209 aldi 22
1 02 202209 sainsburys 34
0 03 202206 tesco 60
1 03 202209 aldi 12
2 03 202209 tesco 40
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.