简体   繁体   中英

List only the subfolder names using spark and python adls gen 2

I have a folder structure where I have a source, the year, the month, the day and then a parquet file, here I store data every day in a new folder.

Source

  • 2022
    • 12
      • 30
      • 31
  • 2023
    • 01
      • 01
      • 02
      • 03

Etc.

I need to dynamically be able to select the latest folder. In this scenario, it's folder 2023/01/03, but I can't seem to get it out.

I've tried importing os and used the following code:

pq_date_folders = f'{abfss_path}/{var_table}/.'  

for root, dirs, files in os.walk(pq_date_folders, topdown=False):
    for name in dirs: 
        print(os.path.join(root, name))

But nothing get's printed. What am I doing wrong?

Data stored in adls gen 2, queried through databricks using python.

The problem is that you use os library to do that, the databricks cluster and the datalake are in diffrent machines.networks , databricks use credentials to connect to the datalake to get the data, you need to pass these credentails to any operations you want to do on these data, fortunately these credentails exists in your spark session, so you can use hadoop with the spark session configuration to query the data in your datalake:

I implement a function that get the max path under a directory, when we get the max path w check it's subdirectories and we get the max path again and so on ( was tested on Azure databricks with a datalake adls gen2 ):

# First make sure to install hdfs library:
!pip install hdfs

Then:

# Function to get the max directory under a path:
def getLastPath(path, fs):
  pathsList = list(map(lambda x: str(x.getPath()),fs.listStatus(Path(path))))
  return sorted(pathsList)[-1]

Then use it like this to the root path that contains the folder 2022, 2023...:

path = "dbfs:/mnt/xxx-dls/root_path/"
Path = spark.sparkContext._gateway.jvm.org.apache.hadoop.fs.Path
fs = Path(path).getFileSystem(sc._jsc.hadoopConfiguration())
while fs.isDirectory(Path(getLastPath(path, fs))):
  path = getLastPath(path, fs)
print(path)

Another option if you only using databricks is to use dbutilis.fs.ls("/path/..") and get the max folder in each directory.

Ended up using following link from OneCricketeer:

https://github.com/Azure/azure-data-lake-store-python

Below code gave me the paths and from this I was able to extract the names:

folder = dbutils.fs.ls(path)

As you are doing this in pyspark, you can use below alternative also using glob .

For this, first you need to mount the ADLS to databricks.

These are my folders in Storage:

在此处输入图像描述

here I store data every day in a new folder.

If you store the data every day(new folder created for every day), then you can do like below.

import glob, os
import datetime

#if your folder creation and file uploading done on day to day basis
latest_date =datetime.datetime.today().strftime('%Y/%m/%d')

print("Latest_date : ",latest_date)
for x in glob.iglob('/dbfs/mnt/data/**',recursive=True):
    if latest_date in x:
        print(x)

在此处输入图像描述

If your folder creation is done on regular basis like for example 10 days, then you can do like below.

import glob, os
import datetime

#if your folder creation and file uploading done on 10 day basis
d = datetime.datetime.today()
dates_list = [(d - datetime.timedelta(days=x)).strftime('%Y/%m/%d') for x in range(10)]


print("Last 10 days for sample : ",dates_list)
for x in date_list:
    for y in glob.iglob('/dbfs/mnt/data/**',recursive=True):
        if x in y:
            print(y)

在此处输入图像描述

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM