Suppose you have a dataframe as follows:
+---+----------+----------+
| id| date_a| date_b|
+---+----------+----------+
| 1|2020-01-30|2020-01-19|
| 1|2020-01-10|2020-01-19|
| 1|2020-01-10|2020-01-26|
| 1|2020-01-30|2020-01-26|
| 2|2020-01-05|2020-01-08|
| 3|2020-01-08|2020-01-10|
| 3|2020-01-12|2020-01-10|
+---+----------+----------+
For each id, there are date_a and date_b values, in various combinations.
I'd like to filter entries, where for a single id, date_b's are outside of a certain set time range around all date_a's.
A visual interpretation for id = 1 looks like (horizontal is time axis):
|---x---| o |-o--x---|
, where x = date_a, o = date_b and |--- ---| indicates the time range (ie +- 5 days).
Thus, "o" (date_b) entries should be kept, that are within none of the date_a timeranges (here, the first "o").
Example input/output:
Input:
df = spark.createDataFrame(
[(1, '2020-01-10', '2020-01-19'),
(1, '2020-01-10', '2020-01-26'),
(1, '2020-01-30', '2020-01-19'),
(1, '2020-01-30', '2020-01-26'),
(2, '2020-01-05', '2020-01-08'),
(3, '2020-01-08', '2020-01-10'),
(3, '2020-01-12', '2020-01-10'),],
['id', 'date_a', 'date_b']
)
df = df.withColumn('date_a', F.to_date('date_a'))
df = df.withColumn('date_b', F.to_date('date_b'))
df = df.withColumn('diff', F.datediff(df.date_b, df.date_a))
df.orderBy('id', 'date_b').show()
+---+----------+----------+----+
| id| date_a| date_b|diff|
+---+----------+----------+----+
| 1|2020-01-30|2020-01-19| -11|
| 1|2020-01-10|2020-01-19| 9|
| 1|2020-01-30|2020-01-26| -4|
| 1|2020-01-10|2020-01-26| 16|
| 2|2020-01-05|2020-01-08| 3|
| 3|2020-01-08|2020-01-10| 2|
| 3|2020-01-12|2020-01-10| -2|
+---+----------+----------+----+
Within the same id
, we want to get date_b
's where the diff
is >5 or <-6
for all rows with the same date_b
( date_b
is outside of the interval [date_a - 6, date_b + 5]
).
Ie:
For id=1, date_b='2020-01-19'
, (11 > 5 | 11 < -6) & (9 > 5 | 9 < -6) -> entry is kept (True & True)
For id=1, date_b='2020-01-26'
, (4 > 5 | 4 < -6) & (16 > 5 | 16 < -6) -> entry is discarded (False & True)
...
Expected output:
+---+----------+----------+
| id| date_a| date_b|
+---+----------+----------+
| 1|2020-01-10|2020-01-19|
| 1|2020-01-30|2020-01-19|
+---+----------+----------+
here is a possible approach, you can try ( comments inline ):
w = Window.partitionBy("id","date_b").orderBy("id")
cond = (F.col("diff")>5) | (F.col("diff")<-6)
#check if condition is true and get sum over the window
sum_of_true_on_w = F.sum(cond.cast("Integer")).over(w)
#get window size to compare with the sum , there might be a better way to get size
size_of_window = F.max(F.row_number().over(w)).over(w)
#filter where sum over the window is equal to size of window
(df.withColumn("Sum_bool",sum_of_true_on_w)
.withColumn("Window_Size",size_of_window)
.filter(F.col("Sum_bool")==F.col("Window_Size"))
.drop("diff","Sum_bool","Window_Size")).show()
+---+----------+----------+
| id| date_a| date_b|
+---+----------+----------+
| 1|2020-01-10|2020-01-19|
| 1|2020-01-30|2020-01-19|
+---+----------+----------+
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.