[英]Spark SQL Partition By, Window, Order By, Count
Say I have a dataframe containing magazine subscription information: 假设我有一个包含杂志订阅信息的数据框:
subscription_id user_id created_at expiration_date
12384 1 2018-08-10 2018-12-10
83294 1 2018-06-03 2018-10-03
98234 1 2018-04-08 2018-08-08
24903 2 2018-05-08 2018-07-08
32843 2 2018-03-25 2018-05-25
09283 2 2018-04-07 2018-06-07
Now I want to add a column that states how many previous subscriptions a user had that expired before this current subscription began. 现在,我想添加一列,指出用户在此当前订阅开始之前已过期的先前订阅数量。 In other words, how many expiration dates associated with a given user were before this subscription's start date.
换句话说,与给定用户关联的到期日期早于该订阅的开始日期。 Here is the full desired output:
这是所需的全部输出:
subscription_id user_id created_at expiration_date previous_expired
12384 1 2018-08-10 2018-12-10 1
83294 1 2018-06-03 2018-10-03 0
98234 1 2018-04-08 2018-08-08 0
24903 2 2018-05-08 2018-07-08 2
32843 2 2018-03-25 2018-05-03 1
09283 2 2018-01-25 2018-02-25 0
Attempts: 尝试:
EDIT: Tried a variety of lag/lead/etc using Python and I'm now thinking this is a SQL problem 编辑:使用Python尝试了各种滞后/超前/等,我现在认为这是一个SQL问题
df = df.withColumn('shiftlag', func.lag(df.expires_at).over(Window.partitionBy('user_id').orderBy('created_at')))
<--- EDIT, EDIT: Never mind, this doesn't work <---编辑,编辑:没关系,这不起作用
I think I exhausted the lag/lead/shift method and found it doesn't work. 我想我用尽了滞后/超前/移位方法,却发现它不起作用。 I'm now thinkings it would be best to do this using Spark SQL, perhaps with a
case when
to produce the new column, combined with a having
count
, grouped by ID? 我现在认为最好是使用Spark SQL进行此操作,也许是
case when
生成新列的case when
,并结合一个having
count
,按ID分组的情况?
Figured it out using PySpark: 使用PySpark弄清楚了:
I first created another column with an array of all expiration dates for each user: 我首先创建了另一列,其中包含每个用户的所有到期日期的数组:
joined_array = df.groupBy('user_id').agg(collect_set('expiration_date'))
Then joined that array back to the original dataframe: 然后将该数组连接回原始数据框:
joined_array = joined_array.toDF('user_idDROP', 'expiration_date_array')
df = df.join(joined_array, df.user_id == joined_array.user_idDROP, how = 'left').drop('user_idDROP')
Then created a function to iterate through array and add 1 to the count if the created date is greater than the expiration date: 然后创建一个函数以遍历数组,如果创建的日期大于到期日期,则在计数中加1:
def check_expiration_count(created_at, expiration_array):
if not expiration_array:
return 0
else:
count = 0
for i in expiration_array:
if created_at > i:
count += 1
return count
check_expiration_count = udf(check_expiration_count, IntegerType())
Then applied that function to create a new column with the correct count: 然后应用该函数创建一个具有正确计数的新列:
df = df.withColumn('count_of_subs_ending_before_creation', check_expiration_count(df.created_at, df.expiration_array))
Wala. 瓦剌。 Done.
完成。 Thanks everyone (nobody helped but thanks anyway).
谢谢大家(没人帮忙,还是要谢谢)。 Hope someone finds this useful in 2022
希望有人在2022年发现这个有用
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.