[英]Loop and lookup several rows in another table in pyspark
I have two dataframes, table 1: user purchased item on day 0 table 2: price of item over x days (fluctuates day to day) 我有两个数据框,表1:用户在第0天购买的商品表2:商品在x天内的价格(每天波动)
I want to match when the user purchase the price to the item. 我想在用户购买商品价格时进行匹配。 Is there a better way to do this without looping every row then apply a function?
有没有更好的方法可以做到这一点,而无需循环每一行然后应用一个函数?
My final output is I want to know what is the rolling_average 3 day average for apples when john bought it on 1/1? 我的最终输出是我想知道当约翰以1/1的价格购买苹果时,rolling_average 3天平均水平是多少吗?
First Table: John's Table (there could be more user) 第一张桌子:约翰的桌子(可能会有更多的用户)
Date Item Price
1/1/2018 Apple 1
2/14/2018 Grapes 1.99
1/25/2018 Pineapple 1.5
5/25/2018 Apple 0.98
Reference Table: Price Table 参考表:价格表
Date Item Price
1/1/2018 Apple 1
1/2/2018 Apple 0.98
1/3/2018 Apple 0.88
1/4/2018 Apple 1.2
1/5/2018 Apple 1.3
1/6/2018 Apple 1.5
1/7/2018 Apple 1.05
1/8/2018 Apple 1.025
2/10/2018 Grapes 3.10
2/11/2018 Grapes 0.10
2/12/2018 Grapes 5.00
2/13/2018 Grapes 0.40
2/14/2018 Grapes 1.00
2/15/2018 Grapes 2.70
2/16/2018 Grapes 0.40
2/17/2018 Grapes 0.40
1/23/2018 Pineapple 0.50
1/24/2018 Pineapple 0.60
1/25/2018 Pineapple 0.70
1/26/2018 Pineapple 0.60
1/27/2018 Pineapple 0.60
1/28/2018 Pineapple 0.50
1/29/2018 Pineapple 0.70
1/30/2018 Pineapple 0.50
5/21/2018 Apple 7.00
5/22/2018 Apple 6.00
5/23/2018 Apple 5.00
5/24/2018 Apple 6.00
5/25/2018 Apple 5.00
Example for Apple: 苹果示例:
Date Item Price
1/1/2018 Apple 1 #bought on this date
1/2/2018 Apple 0.98 #so next 3 days
1/3/2018 Apple 0.88 0.953333333
1/4/2018 Apple 1.2 1.02
1/5/2018 Apple 1.3 1.126666667
1/6/2018 Apple 1.5 1.333333333
1/7/2018 Apple 1.05 1.283333333
1/8/2018 Apple 1.025 1.191666667
df_price.withColumn('rolling_Average', f.avg("Price").over(Window.partitionBy(f.window("Date", "3 days"))))
so if I understand problem correctly, you want to calculate 3 days average for each item. 因此,如果我对问题的理解正确,则希望计算每个项目的3天平均值。 Then you need simply join table 1 to table 2 to get only sold items with their average price next to actual price.
然后,您只需将表1和表2连接起来,即可获得平均价格仅次于实际价格的已售出商品。 You can do this by using window function.
您可以使用窗口功能执行此操作。 In pyspark it can be something like this:
在pyspark中,可能是这样的:
import pyspark.sql.functions as F
from pyspark.sql.window import Window
df_price = df_price.withColumn(
'rolling_average',
F.avg(df_price.price).over(
Window.partitionBy(df_price.item).orderBy(
df_price.date.desc()
).rowsBetween(0, 3)
)
)
Then you simply do join your table to result of this. 然后,您只需将表加入此操作即可。 In SQL it would be like this:
在SQL中将是这样的:
WITH b as (
SELECT '1/1/2018' as date_p, 'Apple' as item, 1 as price
UNION ALL SELECT '1/2/2018' as date_p, 'Apple' as item, 0.98 as price
UNION ALL SELECT '1/3/2018' as date_p, 'Apple' as item, 0.88 as price
UNION ALL SELECT '1/4/2018' as date_p, 'Apple' as item, 1.2 as price
UNION ALL SELECT '1/5/2018' as date_p, 'Apple' as item, 1.3 as price
UNION ALL SELECT '1/6/2018' as date_p, 'Apple' as item, 1.5 as price
UNION ALL SELECT '1/7/2018' as date_p, 'Apple' as item, 1.05 as price
UNION ALL SELECT '1/8/2018' as date_p, 'Apple' as item, 1.025 as price
UNION ALL SELECT '2/10/2018' as date_p, 'Grape' as item, 3.10 as price)
SELECT *, AVG(price) OVER (
PARTITION BY item ORDER BY date_p DESC
ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING
) from b
If you simply want to group by a specific item (setting your Price Table to df2
): 如果您只想按特定项目分组(将价格表设置为
df2
):
df2['Date'] = pd.to_datetime(df2['Date'])
df2 = df2.set_index('Date')
df2['Rolling'] = df2[df2['Item']=='Apple']['Price'].rolling(3).mean()
Printing df2[df2['Item']=='Apple']
will yield: 打印
df2[df2['Item']=='Apple']
将产生:
Item Price Rolling
Date
2018-01-01 Apple 1.000 NaN
2018-01-02 Apple 0.980 NaN
2018-01-03 Apple 0.880 0.953333
2018-01-04 Apple 1.200 1.020000
2018-01-05 Apple 1.300 1.126667
2018-01-06 Apple 1.500 1.333333
2018-01-07 Apple 1.050 1.283333
2018-01-08 Apple 1.025 1.191667
2018-05-21 Apple 7.000 3.025000
2018-05-22 Apple 6.000 4.675000
2018-05-23 Apple 5.000 6.000000
2018-05-24 Apple 6.000 5.666667
2018-05-25 Apple 5.000 5.333333
The answer is slightly different if you want to restrict to certain date groupings. 如果要限制为某些日期分组,答案会稍有不同。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.