![](/img/trans.png)
[英]PySpark on Databricks getting Relative path in absolute URI: when trying to read in Json Files with DateStamps
[英]Need to read max date folder files In pyspark - Databricks
這是使用 Hadoop FS API 方法listStatus
一種方法。 首先,列出xyz
文件夾下的所有文件,並使用名稱獲取 max 文件夾。 然后執行相同操作以獲取最大天數文件夾:
Path = sc._gateway.jvm.org.apache.hadoop.fs.Path
folder_path = Path("/path/to/xyz")
fs = folder_path.getFileSystem(sc._jsc.hadoopConfiguration())
# list all subfolders and returns path + name
month_folders = [(f.getPath().toString(), f.getPath().getName()) for f in fs.listStatus(folder_path) if f.isDir()]
# filter by name to get the max
max_month_folder = max(month_folders, key=lambda x: x[1])[0]
# Now list day subfolders as for month
day_folder = [(f.getPath().toString(), f.getPath().getName()) for f in fs.listStatus(Path(max_month_folder)) if f.isDir()]
max_day_folder = max(day_folder, key=lambda x: x[1])[0]
# read max folder
spark.read.csv(max_day_folder)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.