How to select a date range in pyspark dataframe

Question

I want to select a portion of my dataframe with dates containing 2022 up to latest date and that may include (today and tomorrow and next). How can I achieve that?

df= df.filter(col("sales_date").contains("2022"))

Answer 1

You can use between function or even '>'

df= df.filter(col("date").between("2022-01-01", "2022-12-31"))

or

df= df.filter(col("date") > "2022-01-01")

Answer 2

As mentioned about, 'between' syntax will do the trick, just make sure your column is converted in a proper format: https://sparkbyexamples.com/spark/spark-convert-string-to-timestamp-format/

Answer 3

you can use like in filter where in % works as wild card char.

scala> var df = Seq(("2022-01-01"),("2021-02-01")).toDF
df: org.apache.spark.sql.DataFrame = [value: string]

scala> df = df.withColumn("date",col("value").cast("date"))
df: org.apache.spark.sql.DataFrame = [value: string, date: date]

scala> df.printSchema
root
|-- value: string (nullable = true)
|-- date: date (nullable = true)

scala> df.show()
+----------+----------+
|     value|      date|
+----------+----------+
|2022-01-01|2022-01-01|
|2021-02-01|2021-02-01|
+----------+----------+


scala> df.filter(col("date").like("2022%")).show()
+----------+----------+
|     value|      date|
+----------+----------+
|2022-01-01|2022-01-01|
+----------+----------+

How to select a date range in pyspark dataframe

Question

3 answers

solution1
1 2023-01-09 09:01:11

solution2
0 2023-01-09 09:03:47

solution3
0 2023-01-09 09:43:36

How to select a date range in pyspark dataframe

Question

3 answers

solution1 1 2023-01-09 09:01:11

solution2 0 2023-01-09 09:03:47

solution3 0 2023-01-09 09:43:36

solution1
1 2023-01-09 09:01:11

solution2
0 2023-01-09 09:03:47

solution3
0 2023-01-09 09:43:36