Spark SQL分区依据，窗口，顺序依据，计数

Question

Say I have a dataframe containing magazine subscription information: 假设我有一个包含杂志订阅信息的数据框：

subscription_id    user_id       created_at       expiration_date
 12384               1           2018-08-10        2018-12-10
 83294               1           2018-06-03        2018-10-03
 98234               1           2018-04-08        2018-08-08
 24903               2           2018-05-08        2018-07-08
 32843               2           2018-03-25        2018-05-25
 09283               2           2018-04-07        2018-06-07

Now I want to add a column that states how many previous subscriptions a user had that expired before this current subscription began. 现在，我想添加一列，指出用户在此当前订阅开始之前已过期的先前订阅数量。 In other words, how many expiration dates associated with a given user were before this subscription's start date. 换句话说，与给定用户关联的到期日期早于该订阅的开始日期。 Here is the full desired output: 这是所需的全部输出：

subscription_id    user_id       created_at       expiration_date   previous_expired
 12384               1           2018-08-10        2018-12-10          1
 83294               1           2018-06-03        2018-10-03          0
 98234               1           2018-04-08        2018-08-08          0
 24903               2           2018-05-08        2018-07-08          2
 32843               2           2018-03-25        2018-05-03          1
 09283               2           2018-01-25        2018-02-25          0

Attempts: 尝试：

EDIT: Tried a variety of lag/lead/etc using Python and I'm now thinking this is a SQL problem 编辑：使用Python尝试了各种滞后/超前/等，我现在认为这是一个SQL问题

df = df.withColumn('shiftlag', func.lag(df.expires_at).over(Window.partitionBy('user_id').orderBy('created_at')))

<--- EDIT, EDIT: Never mind, this doesn't work <---编辑，编辑：没关系，这不起作用

I think I exhausted the lag/lead/shift method and found it doesn't work. 我想我用尽了滞后/超前/移位方法，却发现它不起作用。 I'm now thinkings it would be best to do this using Spark SQL, perhaps with a case when to produce the new column, combined with a having count , grouped by ID? 我现在认为最好是使用Spark SQL进行此操作，也许是case when生成新列的case when ，并结合一个having count ，按ID分组的情况？

Answer 1

Figured it out using PySpark: 使用PySpark弄清楚了：

I first created another column with an array of all expiration dates for each user: 我首先创建了另一列，其中包含每个用户的所有到期日期的数组：

joined_array = df.groupBy('user_id').agg(collect_set('expiration_date'))

Then joined that array back to the original dataframe: 然后将该数组连接回原始数据框：

joined_array = joined_array.toDF('user_idDROP', 'expiration_date_array')
df = df.join(joined_array, df.user_id == joined_array.user_idDROP, how = 'left').drop('user_idDROP')

Then created a function to iterate through array and add 1 to the count if the created date is greater than the expiration date: 然后创建一个函数以遍历数组，如果创建的日期大于到期日期，则在计数中加1：

def check_expiration_count(created_at, expiration_array):
  if not expiration_array:
    return 0
  else:
   count = 0
    for i in expiration_array:
  if created_at > i:
    count += 1
return count

check_expiration_count = udf(check_expiration_count, IntegerType())

Then applied that function to create a new column with the correct count: 然后应用该函数创建一个具有正确计数的新列：

df = df.withColumn('count_of_subs_ending_before_creation', check_expiration_count(df.created_at, df.expiration_array))

Wala. 瓦剌。 Done. 完成。 Thanks everyone (nobody helped but thanks anyway). 谢谢大家（没人帮忙，还是要谢谢）。 Hope someone finds this useful in 2022 希望有人在2022年发现这个有用

Spark SQL分区依据，窗口，顺序依据，计数

问题描述

1 个解决方案

解决方案1
0 已采纳 2018-12-12 15:32:30

Spark SQL分区依据，窗口，顺序依据，计数

问题描述

1 个解决方案

解决方案1 0 已采纳 2018-12-12 15:32:30

解决方案1
0 已采纳 2018-12-12 15:32:30