簡體   English   中英

每次觸發窗口功能

[英]Spark window function per time

我有一些具有以下結構的數據框:

ID| Page    |   User          |    Timestamp      |
|1|Page 1   |Ericd            |2002-09-07 19:39:55|
|1|Page 1   |Liir             |2002-10-12 03:01:42|
|1|Page 1   |Tubby            |2002-10-12 03:02:23|
|1|Page 1   |Mojo             |2002-10-12 03:18:24|
|1|Page 1   |Kirf             |2002-10-12 03:19:03|
|2|Page 2   |The Epopt        |2001-11-28 22:27:37|
|2|Page 2   |Conversion script|2002-02-03 01:49:16|
|2|Page 2   |Bryan Derksen    |2002-02-25 16:51:15|
|2|Page 2   |Gear             |2002-10-04 12:46:06|
|2|Page 2   |Tim Starling     |2002-10-06 08:13:42|
|2|Page 2   |Tim Starling     |2002-10-07 03:00:54|
|2|Page 2   |Salsa Shark      |2003-03-18 01:45:32|

並且我想找到在某個時間段內(例如每月)訪問頁面的用戶數量。 例如,對於2002年的10個月,結果將是

|1|Page 1   |Liir             |2002-10-12 03:01:42| 
|1|Page 1   |Tubby            |2002-10-12 03:02:23|
|1|Page 1   |Mojo             |2002-10-12 03:18:24|
|1|Page 1   |Kirf             |2002-10-12 03:19:03|
|2|Page 2   |Gear             |2002-10-04 12:46:06|
|2|Page 2   |Tim Starling     |2002-10-06 08:13:42|
|2|Page 2   |Tim Starling     |2002-10-07 03:00:54|

和頁數:

              numberOfUsers (in October 2002)
|1|Page 1   |      4
|2|Page 2   |      3 

問題還在於如何在每年的每個月中應用此邏輯。 我想出了如何查找例如最近n天的事件

days = lambda i: i * 86400
window = (Window().partitionBy(col("page"))
          .orderBy(col("timestamp").cast("timestamp").cast("long")).rangeBetween(-days(30), 0))

df = df.withColumn("monthly_occurrences", func.count("user").over(window))
df.show()

一些建議,我將不勝感激

您可以先創建包含年份-月份組合的列,然后使用該列進行分組。 因此,一個可行的示例是:

import pyspark.sql.functions as F

df = sc.parallelize([
    ('2018-06-02T00:00:00.000Z','tim', 'page 1' ),
    ('2018-07-20T00:00:00.000Z','tim', 'page 1' ),
    ('2018-07-20T00:00:00.000Z','john', 'page 2' ),
    ('2018-07-20T00:00:00.000Z','john', 'page 2' ),
    ('2018-08-20T00:00:00.000Z','john', 'page 2' )
]).toDF(("datetime","user","page" ))

df = df.withColumn('yearmonth',F.concat(F.year('datetime'),F.lit('-'),F.month('datetime')))    
df_agg = df.groupBy('yearmonth','page').count()
df_agg.show()

輸出:

+---------+------+-----+
|yearmonth|  page|count|
+---------+------+-----+
|   2018-7|page 2|    2|
|   2018-6|page 1|    1|
|   2018-7|page 1|    1|
|   2018-8|page 2|    1|
+---------+------+-----+

希望這可以幫助!

如果要查找動態期間,請首先將日期轉換為時間戳,然后減去今天的所有時間戳,然后將(整數)除以要分組的時間間隔的時間戳。 下面的代碼按5天間隔對行進行分組。

import pyspark.sql.functions as F
from datetime import datetime

# todays timestamp
Today = datetime.today().timestamp()
# how many timestamp is today 
DAY_TIMESTAMPS = 24 * 60 * 60

df = sc.parallelize([
    ('2017-06-02 00:00:00','tim', 'page 1' ),
    ('2017-07-20 00:00:00','tim', 'page 1' ),
    ('2017-07-21 00:00:00','john', 'page 2' ),
    ('2017-07-22 00:00:00','john', 'page 2' ),
    ('2017-08-23 00:00:00','john', 'page 2' )
]).toDF(("datetime","user","page" ))

# group by five days
timeInterval = 5* DAY_TIMESTAMPS

df \
    .withColumn('timestamp', F.unix_timestamp(F.to_date('datetime', 'yyyy-MM-dd HH:mm:ss'))) \ 
    .withColumn('timeIntervalBefore', ((Today-F.col('timestamp'))/(timeInterval)).cast('integer')) \
    .groupBy('timeIntervalBefore', 'page') \
    .agg(F.count('user').alias('number of users')).show()

結果:

+------------------+------+---------------+
|timeIntervalBefore|  page|number of users|
+------------------+------+---------------+
|                70|page 2|              2|
|                80|page 1|              1|
|                70|page 1|              1|
|                64|page 2|              1|
+------------------+------+---------------+

如果您需要近似時間段的日期:

df \
    .withColumn('timestamp', F.unix_timestamp(F.to_date('datetime', 'yyyy-MM-dd HH:mm:ss'))) \
    .withColumn('timeIntervalBefore', ((Today-F.col('timestamp'))/(timeInterval)).cast('integer')) \
    .groupBy('timeIntervalBefore', 'page') \
    .agg(
        F.count('user').alias('number_of_users'), 
        F.min('timestamp').alias('FirstDay'), 
        F.max('timestamp').alias('LastDay')) \
    .select(
        'page', 
        'number_of_users', 
        F.from_unixtime('firstday').alias('firstDay'), 
        F.from_unixtime('firstday').alias('lastDay')).show()

結果:

+------+---------------+-------------------+-------------------+
|  page|number_of_users|           firstDay|            lastDay|
+------+---------------+-------------------+-------------------+
|page 2|              2|2017-07-21 00:00:00|2017-07-21 00:00:00|
|page 1|              1|2017-06-02 00:00:00|2017-06-02 00:00:00|
|page 1|              1|2017-07-20 00:00:00|2017-07-20 00:00:00|
|page 2|              1|2017-08-23 00:00:00|2017-08-23 00:00:00|
+------+---------------+-------------------+-------------------+

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM