Pandas - 通过划分EAV格式数据为组创建新的属性和值

Question

Suppose I have a dataframe like such:假设我有一个这样的 dataframe：

import pandas as pd
import numpy as np

data = [[5123, '2021-01-01 00:00:00', 'cash','sales$', 105],
        [5123, '2021-01-01 00:00:00', 'cash','items', 20],
        [5123, '2021-01-01 00:00:00', 'card','sales$', 355],
        [5123, '2021-01-01 00:00:00', 'card','items', 50],
        [5123, '2021-01-02 00:00:00', 'cash','sales$', np.nan],
        [5123, '2021-01-02 00:00:00', 'cash','items', np.nan],
        [5123, '2021-01-02 00:00:00', 'card','sales$', 170],
        [5123, '2021-01-02 00:00:00', 'card','items', 35]]

columns = ['Store', 'Date', 'Payment Method', 'Attribute', 'Value']

df = pd.DataFrame(data = data, columns = columns)

Store店铺	Date日期	Payment Method付款方法	Attribute属性	Value价值
5123 5123	2021-01-01 00:00:00 2021-01-01 00:00:00	cash现金	sales$销售额$	105 105
5123 5123	2021-01-01 00:00:00 2021-01-01 00:00:00	cash现金	items项目	20 20
5123 5123	2021-01-01 00:00:00 2021-01-01 00:00:00	card卡片	sales$销售额$	355 355
5123 5123	2021-01-01 00:00:00 2021-01-01 00:00:00	card卡片	items项目	50 50
5123 5123	2021-01-02 00:00:00 2021-01-02 00:00:00	cash现金	sales$销售额$	NaN钠盐
5123 5123	2021-01-02 00:00:00 2021-01-02 00:00:00	cash现金	items项目	NaN钠盐
5123 5123	2021-01-02 00:00:00 2021-01-02 00:00:00	card卡片	sales$销售额$	170 170
5123 5123	2021-01-02 00:00:00 2021-01-02 00:00:00	card卡片	items项目	35 35

I would like to create a new attribute, called "average item price", which is generated by, for each Store/Date/Payment Method, dividing the sales$ by the items (eg for store 5123, 2021-01-01, cash, I would like to create a new row with an attribute called "average item price", with a value equal to 5.25).我想创建一个名为“平均商品价格”的新属性，它是通过将每个商店/日期/付款方式的销售额除以商品（例如商店 5123、2021-01-01、现金，我想创建一个新行，其中包含一个名为“平均商品价格”的属性，其值等于 5.25)。

I realize that I could pivot this data out, and have one column for sales, one column for items, and divide the two columns, then restack, but is there a better way to do this without having to pivot?我意识到我可以 pivot 这个数据出来，有一列用于销售，一列用于项目，然后将两列分开，然后重新堆叠，但是有没有更好的方法来做到这一点而不必 pivot？

Store店铺	Date日期	Payment Method付款方法	Attribute属性	Value价值
5123 5123	2021-01-01 00:00:00 2021-01-01 00:00:00	cash现金	sales$销售额$	105 105
5123 5123	2021-01-01 00:00:00 2021-01-01 00:00:00	cash现金	items项目	20 20
5123 5123	2021-01-01 00:00:00 2021-01-01 00:00:00	cash现金	average item price平均商品价格	5.25 5.25
5123 5123	2021-01-01 00:00:00 2021-01-01 00:00:00	card卡片	sales$销售额$	355 355
5123 5123	2021-01-01 00:00:00 2021-01-01 00:00:00	card卡片	items项目	50 50
5123 5123	2021-01-01 00:00:00 2021-01-01 00:00:00	card卡片	average item price平均商品价格	7.10 7.10
5123 5123	2021-01-02 00:00:00 2021-01-02 00:00:00	cash现金	sales$销售额$	NaN钠盐
5123 5123	2021-01-02 00:00:00 2021-01-02 00:00:00	cash现金	items项目	NaN钠盐
5123 5123	2021-01-02 00:00:00 2021-01-02 00:00:00	cash现金	average item price平均商品价格	NaN钠盐
5123 5123	2021-01-02 00:00:00 2021-01-02 00:00:00	card卡片	sales$销售额$	170 170
5123 5123	2021-01-02 00:00:00 2021-01-02 00:00:00	card卡片	items项目	35 35
5123 5123	2021-01-02 00:00:00 2021-01-02 00:00:00	card卡片	average item price平均商品价格	4.86 4.86

Answer 1

You can use pivot_table to get the sum of sales/items per group, then compute the average value and merge with the original data:您可以使用pivot_table获取每组的销售额/商品的总和，然后计算平均值并与原始数据merge ：

s = (df.pivot_table(index=['Store', 'Date', 'Payment Method'],
                    columns='Attribute', values='Value', aggfunc='sum')
       .assign(avg=lambda d: d['sales$']/d['items'])
       ['avg']
     )

df.merge(s, left_on=['Store', 'Date', 'Payment Method'], right_index=True)

output: output：

   Store                 Date Payment Method Attribute  Value       avg
0   5123  2021-01-01 00:00:00           cash    sales$  105.0  5.250000
1   5123  2021-01-01 00:00:00           cash     items   20.0  5.250000
2   5123  2021-01-01 00:00:00           card    sales$  355.0  7.100000
3   5123  2021-01-01 00:00:00           card     items   50.0  7.100000
4   5123  2021-01-02 00:00:00           cash    sales$    NaN       NaN
5   5123  2021-01-02 00:00:00           cash     items    NaN       NaN
6   5123  2021-01-02 00:00:00           card    sales$  170.0  4.857143
7   5123  2021-01-02 00:00:00           card     items   35.0  4.857143

Answer 2

One option is to set the index, do the computation, and use categoricals to get a sorted output that matches yours:一种选择是设置索引，进行计算，并使用分类来获得与您匹配的排序 output：

cols = df.columns[:-1].tolist()
temp = df.set_index(cols)
# computation
summary = temp.xs('sales$', level='Attribute').div(temp.xs('items', level='Attribute'))
# add attribute to index, with a name:
summary = summary.set_index([['average item price'] * len(summary)], 
                            append = True)
summary.index = summary.index.set_names('Attribute', level = -1)
output = pd.concat([temp, summary]).reset_index()
# create categoricals and sort:
dtype = pd.CategoricalDtype(['sales$', 'items', 'average item price'], ordered = True)
output.Attribute = output.Attribute.astype(dtype)
dtype = pd.CategoricalDtype(['cash', 'card'], ordered = True)
output['Payment Method'] = output['Payment Method'].astype(dtype)
output.sort_values(cols)

    Store                 Date Payment Method           Attribute       Value
0    5123  2021-01-01 00:00:00           cash              sales$  105.000000
1    5123  2021-01-01 00:00:00           cash               items   20.000000
8    5123  2021-01-01 00:00:00           cash  average item price    5.250000
2    5123  2021-01-01 00:00:00           card              sales$  355.000000
3    5123  2021-01-01 00:00:00           card               items   50.000000
9    5123  2021-01-01 00:00:00           card  average item price    7.100000
4    5123  2021-01-02 00:00:00           cash              sales$         NaN
5    5123  2021-01-02 00:00:00           cash               items         NaN
10   5123  2021-01-02 00:00:00           cash  average item price         NaN
6    5123  2021-01-02 00:00:00           card              sales$  170.000000
7    5123  2021-01-02 00:00:00           card               items   35.000000
11   5123  2021-01-02 00:00:00           card  average item price    4.857143

Answer 3

pivot and then append after assigning the "Attribute" as needed:根据需要分配“属性”后， pivot和append ：

pivoted = df.pivot(["Store", "Date", "Payment Method"], "Attribute", "Value")
output = (df.append(pivoted["sales$"].div(pivoted["items"])
                    .rename("Value").reset_index()
                    .assign(Attribute="average item price"), ignore_index=True)
          .sort_values(["Store", "Date", "Payment Method"])
          .reset_index(drop=True)
          )

>>> output
    Store                 Date Payment Method           Attribute       Value
0    5123  2021-01-01 00:00:00           card              sales$  355.000000
1    5123  2021-01-01 00:00:00           card               items   50.000000
2    5123  2021-01-01 00:00:00           card  average item price    7.100000
3    5123  2021-01-01 00:00:00           cash              sales$  105.000000
4    5123  2021-01-01 00:00:00           cash               items   20.000000
5    5123  2021-01-01 00:00:00           cash  average item price    5.250000
6    5123  2021-01-02 00:00:00           card              sales$  170.000000
7    5123  2021-01-02 00:00:00           card               items   35.000000
8    5123  2021-01-02 00:00:00           card  average item price    4.857143
9    5123  2021-01-02 00:00:00           cash              sales$         NaN
10   5123  2021-01-02 00:00:00           cash               items         NaN
11   5123  2021-01-02 00:00:00           cash  average item price         NaN

Answer 4

solution 1：方案一：

def function1(dd:pd.DataFrame):
        dd1=dd.assign(Value=lambda dd:dd.Value/dd.Value.shift(-1)).head(1).assign(Attribute='average item price')
        return pd.concat([dd,dd1])

df.groupby(['Store','Date','Payment Method'],sort=False).apply(function1).reset_index(drop=True).pipe(print)

out：出去：

    Store                 Date Payment Method           Attribute       Value
0    5123  2021-01-01 00:00:00           cash              sales$  105.000000
1    5123  2021-01-01 00:00:00           cash               items   20.000000
2    5123  2021-01-01 00:00:00           cash  average item price    5.250000
3    5123  2021-01-01 00:00:00           card              sales$  355.000000
4    5123  2021-01-01 00:00:00           card               items   50.000000
5    5123  2021-01-01 00:00:00           card  average item price    7.100000
6    5123  2021-01-02 00:00:00           cash              sales$         NaN
7    5123  2021-01-02 00:00:00           cash               items         NaN
8    5123  2021-01-02 00:00:00           cash  average item price         NaN
9    5123  2021-01-02 00:00:00           card              sales$  170.000000
10   5123  2021-01-02 00:00:00           card               items   35.000000
11   5123  2021-01-02 00:00:00           card  average item price    4.857143

or you can use pandasql：或者你可以使用pandasql：

def function1(dd:pd.DataFrame):
        return dd.sql("""
        select * from self union all select tb1.Store,tb1.Date,tb1.[Payment Method],'average item price' Attribute,round(tb1.Value/tb2.value,2) as value from (select * from self where Attribute='sales$') tb1 join (select * from self where Attribute='items') tb2
        """)

df.groupby(['Store','Date','Payment Method']).apply(function1).reset_index(drop=True)

Pandas - 通过划分EAV格式数据为组创建新的属性和值

问题描述

4 个解决方案

解决方案1
1 2021-12-02 16:44:27

解决方案2
1 2021-12-02 20:45:41

解决方案3
0 2021-12-02 18:02:15

解决方案4
0 2023-01-16 06:19:11

Pandas - 通过划分EAV格式数据为组创建新的属性和值

问题描述

4 个解决方案

解决方案1 1 2021-12-02 16:44:27

解决方案2 1 2021-12-02 20:45:41

解决方案3 0 2021-12-02 18:02:15

解决方案4 0 2023-01-16 06:19:11

解决方案1
1 2021-12-02 16:44:27

解决方案2
1 2021-12-02 20:45:41

解决方案3
0 2021-12-02 18:02:15

解决方案4
0 2023-01-16 06:19:11