简体   繁体   English

Spark SQL分区依据,窗口,顺序依据,计数

[英]Spark SQL Partition By, Window, Order By, Count

Say I have a dataframe containing magazine subscription information: 假设我有一个包含杂志订阅信息的数据框:

subscription_id    user_id       created_at       expiration_date
 12384               1           2018-08-10        2018-12-10
 83294               1           2018-06-03        2018-10-03
 98234               1           2018-04-08        2018-08-08
 24903               2           2018-05-08        2018-07-08
 32843               2           2018-03-25        2018-05-25
 09283               2           2018-04-07        2018-06-07

Now I want to add a column that states how many previous subscriptions a user had that expired before this current subscription began. 现在,我想添加一列,指出用户在此当前订阅开始之前已过期的先前订阅数量。 In other words, how many expiration dates associated with a given user were before this subscription's start date. 换句话说,与给定用户关联的到期日期早于该订阅的开始日期。 Here is the full desired output: 这是所需的全部输出:

subscription_id    user_id       created_at       expiration_date   previous_expired
 12384               1           2018-08-10        2018-12-10          1
 83294               1           2018-06-03        2018-10-03          0
 98234               1           2018-04-08        2018-08-08          0
 24903               2           2018-05-08        2018-07-08          2
 32843               2           2018-03-25        2018-05-03          1
 09283               2           2018-01-25        2018-02-25          0

Attempts: 尝试:

EDIT: Tried a variety of lag/lead/etc using Python and I'm now thinking this is a SQL problem 编辑:使用Python尝试了各种滞后/超前/等,我现在认为这是一个SQL问题

df = df.withColumn('shiftlag', func.lag(df.expires_at).over(Window.partitionBy('user_id').orderBy('created_at')))

<--- EDIT, EDIT: Never mind, this doesn't work <---编辑,编辑:没关系,这不起作用

I think I exhausted the lag/lead/shift method and found it doesn't work. 我想我用尽了滞后/超前/移位方法,却发现它不起作用。 I'm now thinkings it would be best to do this using Spark SQL, perhaps with a case when to produce the new column, combined with a having count , grouped by ID? 我现在认为最好是使用Spark SQL进行此操作,也许是case when生成新列的case when ,并结合一个having count ,按ID分组的情况?

Figured it out using PySpark: 使用PySpark弄清楚了:

I first created another column with an array of all expiration dates for each user: 我首先创建了另一列,其中包含每个用户的所有到期日期的数组:

joined_array = df.groupBy('user_id').agg(collect_set('expiration_date'))

Then joined that array back to the original dataframe: 然后将该数组连接回原始数据框:

joined_array = joined_array.toDF('user_idDROP', 'expiration_date_array')
df = df.join(joined_array, df.user_id == joined_array.user_idDROP, how = 'left').drop('user_idDROP')

Then created a function to iterate through array and add 1 to the count if the created date is greater than the expiration date: 然后创建一个函数以遍历数组,如果创建的日期大于到期日期,则在计数中加1:

def check_expiration_count(created_at, expiration_array):
  if not expiration_array:
    return 0
  else:
   count = 0
    for i in expiration_array:
  if created_at > i:
    count += 1
return count

check_expiration_count = udf(check_expiration_count, IntegerType())

Then applied that function to create a new column with the correct count: 然后应用该函数创建一个具有正确计数的新列:

df = df.withColumn('count_of_subs_ending_before_creation', check_expiration_count(df.created_at, df.expiration_array))

Wala. 瓦剌。 Done. 完成。 Thanks everyone (nobody helped but thanks anyway). 谢谢大家(没人帮忙,还是要谢谢)。 Hope someone finds this useful in 2022 希望有人在2022年发现这个有用

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如果没有可用的指定分区路径,SPARK SQL将失败 - SPARK SQL fails if there is no specified partition path available Pandas 等效于 SQL Windows 函数,具有分区依据和排序依据 - Pandas Equivalent to SQL Windows functions with Partition by And Order by SQL滚动窗口唯一计数 - SQL rolling window unique count 由dayofweek DataFrame Spark SQL分组和计数值 - Group by and count values by dayofweek DataFrame Spark SQL Spark避免分区覆盖 - Spark avoid partition overwrite 遍历分区的行号并比较值以在 PySpark SQL (spark 2.4.0) 中创建新列 - Iterate through row numbers of a partition and compare values to create new columns in PySpark SQL (spark 2.4.0) 如何在spark中为mapPartition指定分区 - how to specify the partition for mapPartition in spark 如何在窗口内第一次排序后保留 Spark 数据帧中的默认顺序 - How to preserve default order in Spark dataframe after first order by inside window 从 Python 创建 Spark 上下文,以便运行数据块 sql - Create Spark context from Python in order to run databricks sql 如何获得熊猫的分组窗口? 即喜欢从SQL到WINDOOW OVER…PARITION BY - How can I get grouped windows in pandas? I.e. like WINDOW OVER … PARTITION BY … from SQL
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM