简体   繁体   中英

spark.read.json() how to read files using a dynamic year parameter

I have an s3 directory that looks something like this:

my-data/data-events/year/month/day/hour/file.json

I have the years 2018, 2019, 2020, 2021 and 2022. I want to read in everything from 2020 onward. How would I do that without moving files around? I currently have my read function like this, which reads in all files:

spark.read.json("s3a://my-data/data-events/*/*/*/*/*")

how about spark.read.json("s3a://my-data/data-events/{2020,2021,2022}/*/*/*/*")

to make it dynamic based on current_year

from datetime import datetime

current_year = datetime.now().year
years_to_query = ",".join([str(x) for x in range(2020, current_year + 1)])
f"s3a://my-data/data-events/{{{years_to_query}}}/*/*/*/*"

Also if you have partitioned your data, eg: the path would have been my-data/data-events/year=YYYY/month=MM/day=DD/hour=HH/file.json you could have spark.read.json("s3a://my-data/data-events") and then .filter(col("year") >= '2020') , this filter would have been run against the folders paths and not by looking into the jsons themselves.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM