简体   繁体   中英

filter pyspark dataframe rows between two different date ranges

I have a pyspark dataframe with "id" and date column "parsed_date" (dtypes: date, format: YYYY-mm-dd). I have created a function that gives me count of id for each day in the given date range.

the feunction returns 2 dataframes. df1 with rows from date range ± 1 weeks and df2 with rows ±2 weeks from the given day in the following way:

df1 should filter rows from range 1: (day-t, day+t) df2 should filter rows from range 2: (day - 2 t, day - t).append(day + t, day + 2 t)

This is the code that I have been using otherwise for creating df1 with the required date range but I don't know to append/concat date range for df2

def part_1(df, day, t):
"""
Example usage: df_list = part_1(df, '2017-12-18', 2)
Returns a list of 2 dataframes.
"""
h1_df1 = (df.filter(f"parsed_date between '{day}' - interval {t} days and '{day}' + interval {t} days")
         .withColumn('count_before', F.count('id').over(Window.partitionBy('parsed_date')))
         .orderBy('parsed_date')
      )


h1_df2 = (df.filter(f"parsed_date between '{day}' - interval {t*2} days and '{day}' - interval {t} days").concat(f"parsed_date between ('{day}' + interval {t} days and '{day}' + interval {t*2} days")
         .withColumn('count_after', F.count('id').over(Window.partitionBy('parsed_date')))
         .orderBy('parsed_date')
      )


return [h1_df1, h1_df2]

output for h1_df1 on calling part_1(df, '2017-12-18', 2)

+-------+-----------+------------+
|     id|parsed_date|count_before|
+-------+-----------+------------+
|1471783| 2017-12-16|           2|
|1471885| 2017-12-16|           2|
|1472928| 2017-12-17|           2|
|1476917| 2017-12-17|           2|
|1477469| 2017-12-18|           1|
|1478190| 2017-12-19|           4|
|1478570| 2017-12-19|           4|
|1481415| 2017-12-19|           4|
|1472592| 2017-12-19|           4|
|1474023| 2017-12-20|           1|
+-------+-----------+------------+

expected outcome for h2_df1 on calling part_1(df, '2017-12-18', 2)

 +-------+-----------+------------+
    |     id|parsed_date| count_after|
    +-------+-----------+------------+
    |1471783| 2017-12-14|           1|
    |1471885| 2017-12-16|           3|
    |1472928| 2017-12-16|           3|
    |1476917| 2017-12-16|           3|
    |1477469| 2017-12-20|           2|
    |1478190| 2017-12-20|           2|
    |1478570| 2017-12-21|           2|
    |1481415| 2017-12-21|           2|
    |1472592| 2017-12-22|           2|
    |1474023| 2017-12-22|           2|
    +-------+-----------+------------+

I would love to get some help for creating h2_df1. Though I gave it a try but it doesn't work.

Please help!

You can use or to combine the two filter conditions, not concat .

h1_df2 = (df.filter(f"""
    (parsed_date between '{day}' - interval {t*2} days and '{day}' - interval {t} days) 
    or
    (parsed_date between '{day}' + interval {t} days and '{day}' + interval {t*2} days)
""")
         .withColumn('count_after', F.count('id').over(Window.partitionBy('parsed_date')))
         .orderBy('parsed_date')
      )

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM