[英]List json files, partitioned by year/month/day, from Azure account storage using PySpark
I have azure account storage with json
files, partitioned by year/month/day/hour.我有 azure 帐户存储和
json
文件,按年/月/日/小时分区。 I need to list all of jsons between two dates, eg.我需要列出两个日期之间的所有 json,例如。 20200505 to 20201220, so I have list of
url/dir
. 20200505 到 20201220,所以我有
url/dir
的列表。 I do not need to load any content, just to list all files, which lives between these two dates.我不需要加载任何内容,只需列出这两个日期之间的所有文件。
I need to use for it azure databricks with pyspark.我需要使用 azure 数据块和 pyspark。 Is it possible to just use sth like:
是否可以像这样使用某物:
.load(from "<Path>/y=2020/month=05/day=05/**/*.json" to "<Path>/y=2020/month=12/day=20/**/*.json")
Spark does not provide a generic way of selecting an interval of date partitions, but you can try to specify the ranges manually as below: Spark 不提供选择日期分区间隔的通用方法,但您可以尝试手动指定范围,如下所示:
.load(
"<Path>/year=2020/month=05/day={0[5-9],[1-3][0-9]}/**/*.json",
"<Path>/year=2020/month={0[6-9],1[0-1]}/day=[0-3][0-9]/**/*.json",
"<Path>/year=2020/month=12/day={[0-1][0-9],20}/**/*.json",
)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.