列出 json 文件，按年/月/日分区，来自 Azure 帐户存储使用 PySpark

Question

I have azure account storage with json files, partitioned by year/month/day/hour.我有 azure 帐户存储和json文件，按年/月/日/小时分区。 I need to list all of jsons between two dates, eg.我需要列出两个日期之间的所有 json，例如。 20200505 to 20201220, so I have list of url/dir . 20200505 到 20201220，所以我有url/dir的列表。 I do not need to load any content, just to list all files, which lives between these two dates.我不需要加载任何内容，只需列出这两个日期之间的所有文件。

I need to use for it azure databricks with pyspark.我需要使用 azure 数据块和 pyspark。 Is it possible to just use sth like:是否可以像这样使用某物：

.load(from "<Path>/y=2020/month=05/day=05/**/*.json" to "<Path>/y=2020/month=12/day=20/**/*.json")

Here is structure of azure account storage: azure账户存储结构如下：

Answer 1

Spark does not provide a generic way of selecting an interval of date partitions, but you can try to specify the ranges manually as below: Spark 不提供选择日期分区间隔的通用方法，但您可以尝试手动指定范围，如下所示：

.load(
    "<Path>/year=2020/month=05/day={0[5-9],[1-3][0-9]}/**/*.json",
    "<Path>/year=2020/month={0[6-9],1[0-1]}/day=[0-3][0-9]/**/*.json",
    "<Path>/year=2020/month=12/day={[0-1][0-9],20}/**/*.json",
)

列出 json 文件，按年/月/日分区，来自 Azure 帐户存储使用 PySpark

问题描述

1 个解决方案

解决方案1
1 2020-12-23 12:18:36

列出 json 文件，按年/月/日分区，来自 Azure 帐户存储使用 PySpark

问题描述

1 个解决方案

解决方案1 1 2020-12-23 12:18:36

解决方案1
1 2020-12-23 12:18:36