如何从 pyspark 数据块在 ADLS gen2 中创建目录

Question

摘要：我正在处理一个用例，我想在数据块中的 pyspark 流作业中通过 ADLS 中的 cv2 写入图像，但是如果该目录不存在，它就不起作用。 但我想根据图像属性以特定结构存储图像。 所以基本上我需要在运行时检查目录是否存在，如果不存在则创建它。 最初我尝试使用 dbutils，但 dbutils 不能在 pyspark api 中使用。https://github.com/MicrosoftDocs/azure-docs/issues/28070

预期结果：能够在运行时从 ADLS Gen2 中的 pyspark 流作业中创建目录。

可重现的代码：

# Read images in batch for simplicity
df = spark.read.format('binaryFile').option('recursiveLookUp',True).option("pathGlobfilter", "*.jpg").load(path_to_source')

# Get necessary columns

df = df.withColumn('ingestion_timestamp',F.current_timestamp())
.withColumn('source_ingestion_date',F.to_date(F.split('path','/')[10]))
.withColumn('source_image_path',F.regexp_replace(F.col('path'),'dbfs:','/dbfs/')
.withColumn('source_image_time',F.substring(F.split('path','/')[12],0,8))
.withColumn('year', F.date_format(F.to_date(F.col('source_ingestion_date')),'yyyy'))
.withColumn('month', F.date_format(F.to_date(F.col('source_ingestion_date')),'MM'))
.withColumn('day', F.date_format(F.to_date(F.col('source_ingestion_date')),'dd'))
.withColumn('base_path', F.concat(F.lit('/dbfs/mnt/development/testing/'),F.lit('/year='),F.col('year'),
                                 F.lit('/month='),F.col('month'),
                                 F.lit('/day='),F.col('day'))

# function to be called in foreach call          
def processRow(row):
    source_image_path = row['source_image_path']
    base_path = row['base_path']
    source_image_time = row['source_image_time']
    if not CheckPathExists(base_path):
      dbutils.fs.mkdirs(base_path)
    full_path = f"{base_path}/{source_image_time}.jpg"
    im = image=cv2.imread(source_image_path)
    cv2.imwrite(full_path,im)

# This fails

df.foreach(processRow)

# Due to below code block
if not CheckPathExists(base_path):
  dbutils.fs.mkdirs(base_path)
full_path = f"{base_path}/{source_image_time}.jpg"
im = image=cv2.imread(source_image_path)
cv2.imwrite(full_path,im)

有人有什么建议吗？

Answer 1

据我所知， dbutils.fs.mkdirs dbutils.fs.mkdirs(base_path)适用于dbfs:/mnt/mount_point/folder之类的路径。

我已经复制了这个，当我使用mkdirs function 检查/dbfs/mnt/mount_point/folder之类的路径时，该文件夹没有在 ADLS 中创建，即使它在 databricks 中给了我True 。

但是对于dbfs:/mnt/mount_point/folder它工作正常。

这可能是这里的问题。 因此，首先使用此路径/dbfs/mnt/mount_point/folder检查路径是否存在，如果不存在，则使用dbfs:/此路径创建目录。

例子：

import os  
base_path="/dbfs/mnt/data/folder1"  
print("before : ",os.path.exists(base_path))

if not os.path.exists(base_path):  
base_path2="dbfs:"+base_path[5:]  
dbutils.fs.mkdirs(base_path2)

print("after : ",os.path.exists(base_path))

在此处输入图像描述

您可以看到文件夹已创建。

在此处输入图像描述

如果您不想直接使用os ，请使用以下列表检查路径是否存在并创建目录。

在此处输入图像描述

如何从 pyspark 数据块在 ADLS gen2 中创建目录

问题描述

1 个解决方案

解决方案1
0 2022-12-28 11:30:46

如何从 pyspark 数据块在 ADLS gen2 中创建目录

问题描述

1 个解决方案

解决方案1 0 2022-12-28 11:30:46

解决方案1
0 2022-12-28 11:30:46