简体   繁体   中英

SQL / Pyspark - Add new column based on a dynamic timestamp and another column

I have this data:

id, name, timestamp
1, David, 2022/01/01 10:00
2, David, 2022/01/01 10:30
3, Diego, 2022/01/01 10:59
4, David, 2022/01/01 10:59
5, David, 2022/01/01 11:01
6, Diego, 2022/01/01 12:00
7, David, 2022/01/01 12:00
8, David, 2022/01/01 12:05
9, Diego, 2022/01/01 12:30

Basically David and Diego are playing a game. They smash a button from time to time at those timestamps.

The game can continue for one hour after they pushed the button for the first time. After that the count will reset and if they push the button again it will count as they are starting to play again.

So I want to tag as 0 (start) when is the first time they are using the button in an hour period and with 1 (playing) if they are in that hour period.

So in my case I would except this from the result:

id, name, timestamp, status
1, David, 2022/01/01 10:00, 0  <--- David starts playing
2, David, 2022/01/01 10:30, 1  <--- David keeps playing the game that he started at the id 1
3, Diego, 2022/01/01 10:59, 0  <--- Diego starts playing
4, David, 2022/01/01 10:59, 1  <--- David keeps playing the game that he started at the id 1
5, David, 2022/01/01 11:01, 0  <--- David starts playing again
6, Diego, 2022/01/01 12:00, 0  <--- Diego starts playing again
7, David, 2022/01/01 12:00, 1  <--- David keeps playing the game that he started at the id 5
8, David, 2022/01/01 12:05, 0  <--- David start playing again
9, Diego, 2022/01/01 12:05, 1  <--- Diego keeps playing the game that he started at the id 6

I would need to do that transformation in pyspark just to tag what I consider as start playing and keep playing .

Maybe if you can help me with a SQL query I can adapt it lately to pyspark.

It doesn't need to be done in only one query / step.

Hope you can help me.

This is not a complete solution, but to have any ideas I have tried like this

from pyspark.sql.functions import explode
from datetime import datetime
from pyspark.sql.types import *
schema = StructType([StructField('id', StringType(), True),
                     StructField('name', StringType(), True),
                     StructField('timestamp', TimestampType(), True)])
df = spark.createDataFrame(
[
      ("1", "David", datetime.strptime("2022/01/01 10:00", '%Y/%m/%d %H:%M')),
      ("2", "David", datetime.strptime("2022/01/01 10:30",'%Y/%m/%d %H:%M')),
      ("3", "Diego", datetime.strptime("2022/01/01 10:59",'%Y/%m/%d %H:%M')),
      ("4", "David", datetime.strptime("2022/01/01 10:59", '%Y/%m/%d %H:%M')),
      ("5", "David", datetime.strptime("2022/01/01 11:01", '%Y/%m/%d %H:%M')),
      ("6", "Diego", datetime.strptime("2022/01/01 12:00", '%Y/%m/%d %H:%M')),
      ("7", "David", datetime.strptime("2022/01/01 12:00", '%Y/%m/%d %H:%M')),
      ("8", "David", datetime.strptime("2022/01/01 12:05", '%Y/%m/%d %H:%M')),
      ("9", "Diego", datetime.strptime("2022/01/01 12:30", '%Y/%m/%d %H:%M')),
],
schema=schema)
df.createOrReplaceTempView("people")
df3=spark.sql("select *,dense_rank()over(partition by hour(timestamp) order by name,timestamp )%2 as t4, case when dense_rank()over(partition by hour(timestamp) order by name,timestamp )%2>0 then dense_rank()over(partition by hour(timestamp) order by name,timestamp )%2-1 else  \
dense_rank()over(partition by hour(timestamp) order by name,timestamp )%2+1 end t3 from people order by timestamp,name")
df3.show()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM