spark.read.json() how to read files using a dynamic year parameter

Question

I have an s3 directory that looks something like this:

my-data/data-events/year/month/day/hour/file.json

I have the years 2018, 2019, 2020, 2021 and 2022. I want to read in everything from 2020 onward. How would I do that without moving files around? I currently have my read function like this, which reads in all files:

spark.read.json("s3a://my-data/data-events/*/*/*/*/*")

Answer 1

how about spark.read.json("s3a://my-data/data-events/{2020,2021,2022}/*/*/*/*")

to make it dynamic based on current_year

from datetime import datetime

current_year = datetime.now().year
years_to_query = ",".join([str(x) for x in range(2020, current_year + 1)])
f"s3a://my-data/data-events/{{{years_to_query}}}/*/*/*/*"

Also if you have partitioned your data, eg: the path would have been my-data/data-events/year=YYYY/month=MM/day=DD/hour=HH/file.json you could have spark.read.json("s3a://my-data/data-events") and then .filter(col("year") >= '2020') , this filter would have been run against the folders paths and not by looking into the jsons themselves.

spark.read.json() how to read files using a dynamic year parameter

Question

1 answers

solution1
1 ACCPTED 2022-06-09 18:57:58

spark.read.json() how to read files using a dynamic year parameter

Question

1 answers

solution1 1 ACCPTED 2022-06-09 18:57:58

solution1
1 ACCPTED 2022-06-09 18:57:58