如何使用 python 浏览 ADLS 文件夹？

Question

I am using the below code snippet to walk through the folders and files in dbfs using python:我正在使用下面的代码片段使用 python 遍历 dbfs 中的文件夹和文件：

for subdir, dirs, files in os.walk("/dbfs/data"):
    for file in files:
        if re.search(contrast, file):
            df = pd.read_csv(os.path.join(subdir, file))
            tot_contrast_sh = tot_contrast_sh.append(df, sort=False)
        elif re.search(contrast_rolled, file):
            df = pd.read_csv(os.path.join(subdir, file))
            tot_contrast_rolled_sh = tot_contrast_rolled_sh.append(df, sort=False)

I want to implement the above functionality with python and pandas but the folder is located in adls, how should I proceed with this?我想用 python 和 pandas 实现上述功能，但是文件夹位于 adls 中，我应该如何继续？ Is there a way to implement this?有没有办法实现这个？

Answer 1

To walk through the folders of ADLS in databricks, first you need to mount the ADLS to databricks .要遍历 databricks 中的 ADLS 文件夹，首先需要将 ADLS 挂载到 databricks 。

Mount the ADLS to databricks using Service principal.使用服务主体将 ADLS 挂载到数据块。

To do that, create app registration in the Azure.为此，请在 Azure 中创建应用注册。

Go to Azure Active directory->App registration->New registration and create one. Go 到Azure 活动目录->应用注册->新注册并创建一个。

App registration Overview:应用注册概述：

在此处输入图像描述

Now create a Secret in your App registration.现在在您的应用注册中创建一个 Secret。

在此处输入图像描述

Code for mounting:安装代码：

configs = {"fs.azure.account.auth.type": "OAuth",

"fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",

"fs.azure.account.oauth2.client.id": "< your client id >",

"fs.azure.account.oauth2.client.secret": "< Secret value >",

"fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/< Directory (tenant) id >/oauth2/token"}

dbutils.fs.mount(

source = "abfss://< container >@< Storage account >.dfs.core.windows.net/",

mount_point = "/mnt/< mountpoint >",

extra_configs = configs)

My mounting:我的安装：

在此处输入图像描述

Now you can access the ADLS folders and files with path /dbfs/mnt/< mountpoint > in your code.现在，您可以在代码中使用路径/dbfs/mnt/< mountpoint >访问 ADLS 文件夹和文件。

My files in ADLS:我在 ADLS 中的文件：

在此处输入图像描述

My files in databricks for sample:我在数据块中的文件作为示例：

在此处输入图像描述

Answer 2

I developed the below code to achieve this in ADLS and without mounting the adls to dbfs:我开发了以下代码以在 ADLS 中实现此目的，并且没有将 adls 安装到 dbfs：

files = dbutils.fs.ls(abfss://data)
while(files):
  for file in files:
    if file.path.endswith("/"):
      files.extend(dbutils.fs.ls(file.path))
    elif file.path.endswith(".csv"):
      if file.path.__contains__(saida_import):
        df = pd.read_csv(file.path, storage_options = adls_cred)
        tot_sh = tot_sh.append(df, sort=False)
      elif file.path.__contains__(tableau_import):
        df = pd.read_csv(file.path, storage_options = adls_cred)
        tot_tableau_sh = tot_tableau_sh.append(df, sort=False)
      elif file.path.__contains__(funnel_import):
        df = pd.read_csv(file.path, storage_options = adls_cred)
        tot_funnel_sh = tot_funnel_sh.append(df, sort=False)

    files.remove(file)

如何使用 python 浏览 ADLS 文件夹？

问题描述

2 个解决方案

解决方案1
0 2022-09-22 09:17:52

解决方案2
0 2022-09-22 14:31:11

如何使用 python 浏览 ADLS 文件夹？

问题描述

2 个解决方案

解决方案1 0 2022-09-22 09:17:52

解决方案2 0 2022-09-22 14:31:11

解决方案1
0 2022-09-22 09:17:52

解决方案2
0 2022-09-22 14:31:11