简体   繁体   中英

Filter in a spark window by comparing a single row element with all rows of the window

Suppose you have a dataframe as follows:

+---+----------+----------+
| id|    date_a|    date_b|
+---+----------+----------+
|  1|2020-01-30|2020-01-19|
|  1|2020-01-10|2020-01-19|
|  1|2020-01-10|2020-01-26|
|  1|2020-01-30|2020-01-26|
|  2|2020-01-05|2020-01-08|
|  3|2020-01-08|2020-01-10|
|  3|2020-01-12|2020-01-10|
+---+----------+----------+

For each id, there are date_a and date_b values, in various combinations.


I'd like to filter entries, where for a single id, date_b's are outside of a certain set time range around all date_a's.

A visual interpretation for id = 1 looks like (horizontal is time axis):

|---x---| o |-o--x---|

, where x = date_a, o = date_b and |--- ---| indicates the time range (ie +- 5 days).
Thus, "o" (date_b) entries should be kept, that are within none of the date_a timeranges (here, the first "o").


Example input/output:

Input:

df = spark.createDataFrame(
    [(1, '2020-01-10', '2020-01-19'), 
     (1, '2020-01-10', '2020-01-26'),
     (1, '2020-01-30', '2020-01-19'),
     (1, '2020-01-30', '2020-01-26'),    
     (2, '2020-01-05', '2020-01-08'),
     (3, '2020-01-08', '2020-01-10'),
     (3, '2020-01-12', '2020-01-10'),],
     ['id', 'date_a', 'date_b']
)

df = df.withColumn('date_a', F.to_date('date_a'))
df = df.withColumn('date_b', F.to_date('date_b'))
df = df.withColumn('diff', F.datediff(df.date_b, df.date_a))
df.orderBy('id', 'date_b').show()

+---+----------+----------+----+
| id|    date_a|    date_b|diff|
+---+----------+----------+----+
|  1|2020-01-30|2020-01-19| -11|
|  1|2020-01-10|2020-01-19|   9|
|  1|2020-01-30|2020-01-26|  -4|
|  1|2020-01-10|2020-01-26|  16|
|  2|2020-01-05|2020-01-08|   3|
|  3|2020-01-08|2020-01-10|   2|
|  3|2020-01-12|2020-01-10|  -2|
+---+----------+----------+----+

Within the same id , we want to get date_b 's where the diff is >5 or <-6 for all rows with the same date_b ( date_b is outside of the interval [date_a - 6, date_b + 5] ).
Ie:
For id=1, date_b='2020-01-19' , (11 > 5 | 11 < -6) & (9 > 5 | 9 < -6) -> entry is kept (True & True)
For id=1, date_b='2020-01-26' , (4 > 5 | 4 < -6) & (16 > 5 | 16 < -6) -> entry is discarded (False & True)
...

Expected output:

+---+----------+----------+
| id|    date_a|    date_b|
+---+----------+----------+
|  1|2020-01-10|2020-01-19|
|  1|2020-01-30|2020-01-19|
+---+----------+----------+

here is a possible approach, you can try ( comments inline ):

w = Window.partitionBy("id","date_b").orderBy("id")
cond = (F.col("diff")>5) | (F.col("diff")<-6)

#check if condition is true and get sum over the window
sum_of_true_on_w = F.sum(cond.cast("Integer")).over(w) 

#get window size to compare with the sum , there might be a better way to get size
size_of_window = F.max(F.row_number().over(w)).over(w)

#filter where sum over the window is equal to size of window
(df.withColumn("Sum_bool",sum_of_true_on_w)
   .withColumn("Window_Size",size_of_window)
   .filter(F.col("Sum_bool")==F.col("Window_Size"))
   .drop("diff","Sum_bool","Window_Size")).show()

+---+----------+----------+
| id|    date_a|    date_b|
+---+----------+----------+
|  1|2020-01-10|2020-01-19|
|  1|2020-01-30|2020-01-19|
+---+----------+----------+

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM