简体   繁体   English

列出 json 文件,按年/月/日分区,来自 Azure 帐户存储使用 PySpark

[英]List json files, partitioned by year/month/day, from Azure account storage using PySpark

I have azure account storage with json files, partitioned by year/month/day/hour.我有 azure 帐户存储和json文件,按年/月/日/小时分区。 I need to list all of jsons between two dates, eg.我需要列出两个日期之间的所有 json,例如。 20200505 to 20201220, so I have list of url/dir . 20200505 到 20201220,所以我有url/dir的列表。 I do not need to load any content, just to list all files, which lives between these two dates.我不需要加载任何内容,只需列出这两个日期之间的所有文件。

I need to use for it azure databricks with pyspark.我需要使用 azure 数据块和 pyspark。 Is it possible to just use sth like:是否可以像这样使用某物:

.load(from "<Path>/y=2020/month=05/day=05/**/*.json" to "<Path>/y=2020/month=12/day=20/**/*.json")

Here is structure of azure account storage: azure账户存储结构如下: 在此处输入图像描述

Spark does not provide a generic way of selecting an interval of date partitions, but you can try to specify the ranges manually as below: Spark 不提供选择日期分区间隔的通用方法,但您可以尝试手动指定范围,如下所示:

.load(
    "<Path>/year=2020/month=05/day={0[5-9],[1-3][0-9]}/**/*.json",
    "<Path>/year=2020/month={0[6-9],1[0-1]}/day=[0-3][0-9]/**/*.json",
    "<Path>/year=2020/month=12/day={[0-1][0-9],20}/**/*.json",
)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM