[英]How to get sum of product from columns in 2 data frame using PySpark
There are 2 Data Frames one contain the price information as follows (day 1 to Day 100 as 100 rows),有 2 个数据框,其中一个包含以下价格信息(第 1 天到第 100 天为 100 行),
StoreId,ItemID, Date,Price
HH-101,item1, d_1, €9
HH-101,item1, d_2, €7
……………………………
DH-101,item1, d_90, €4
……………………………
HH-101,item1, d_100, €3
The 2nd data frame is a sales Information as shown (day 1 to Day 100 as 100 columns but 1 row)第二个数据框是如图所示的销售信息(第 1 天到第 100 天为 100 列但 1 行)
Stored_ID, ItemID, d-1, d-2, d-3,……. d-90,d-100
HH-101 , item1 , 2 , 4 , 0,………..,12 ,22
HH-101 , item2 , 1 , 0 , 3 ……………,3 ,3
What is the optimum PySpark script to produce another Data frame产生另一个数据帧的最佳PySpark 脚本是什么
with new column, which has the summation of带有新列,其总和为
number of unit * Sales price, corresponding to each item单位数量 * 销售价格,对应每件商品
example for store HH-101 and item1
2*9+ 4*7+........+.....+...12*4+22*3
Is there any single step instead of wrting sum of product for more than 100 column是否有任何单个步骤而不是为超过 100 列写入产品总和
Here's a simpler example derived from your sample dataframes.这是从您的示例数据框派生的一个更简单的示例。 I think it should also be scalable to your real data.
我认为它也应该可以扩展到您的真实数据。
df1.show()
+-------+------+----+-----+
|StoreId|ItemID|Date|Price|
+-------+------+----+-----+
| HH-101| item1| d_1| €9|
| HH-101| item1| d_2| €7|
+-------+------+----+-----+
df2.show()
+-------+------+---+---+
|StoreId|ItemID|d_1|d_2|
+-------+------+---+---+
| HH-101| item1| 2| 4|
| HH-101| item2| 1| 0|
+-------+------+---+---+
You can unpivot df2
using stack
with a query string generated from a list comprehension of the column names, then join to df1
using the first 3 columns, group by the store id and item id, and get the sum of price * number
.您可以使用
stack
和从列名的列表理解生成的查询字符串来取消df2
,然后使用前 3 列连接到df1
,按商店 ID 和项目 ID 分组,并获得price * number
的总和。
result = df2.selectExpr(
'StoreId', 'ItemID',
'stack(2, ' + ', '.join(["'%s', %s" % (c, c) for c in df2.columns[2:]]) + ') as (Date, Number)'
# "stack(2, 'd_1', d_1, 'd_2', d_2) as (Date, Number)"
).join(
df1, df1.columns[:3]
).groupBy(
'StoreId', 'ItemID'
).agg(
F.expr('sum(Number * float(substr(Price, 2))) as Total')
)
result.show()
+-------+------+-----+
|StoreId|ItemID|Total|
+-------+------+-----+
| HH-101| item1| 46.0|
+-------+------+-----+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.