简体   繁体   English

如何使用 python 浏览 ADLS 文件夹?

[英]How to walk through ADLS folder using python?

I am using the below code snippet to walk through the folders and files in dbfs using python:我正在使用下面的代码片段使用 python 遍历 dbfs 中的文件夹和文件:

for subdir, dirs, files in os.walk("/dbfs/data"):
    for file in files:
        if re.search(contrast, file):
            df = pd.read_csv(os.path.join(subdir, file))
            tot_contrast_sh = tot_contrast_sh.append(df, sort=False)
        elif re.search(contrast_rolled, file):
            df = pd.read_csv(os.path.join(subdir, file))
            tot_contrast_rolled_sh = tot_contrast_rolled_sh.append(df, sort=False)

I want to implement the above functionality with python and pandas but the folder is located in adls, how should I proceed with this?我想用 python 和 pandas 实现上述功能,但是文件夹位于 adls 中,我应该如何继续? Is there a way to implement this?有没有办法实现这个?

To walk through the folders of ADLS in databricks, first you need to mount the ADLS to databricks .要遍历 databricks 中的 ADLS 文件夹,首先需要将 ADLS 挂载到 databricks

Mount the ADLS to databricks using Service principal.使用服务主体将 ADLS 挂载到数据块。

To do that, create app registration in the Azure.为此,请在 Azure 中创建应用注册。

Go to Azure Active directory->App registration->New registration and create one. Go 到Azure 活动目录->应用注册->新注册并创建一个。

App registration Overview:应用注册概述:

在此处输入图像描述

Now create a Secret in your App registration.现在在您的应用注册中创建一个 Secret。

在此处输入图像描述

Code for mounting:安装代码:

configs = {"fs.azure.account.auth.type": "OAuth",

"fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",

"fs.azure.account.oauth2.client.id": "< your client id >",

"fs.azure.account.oauth2.client.secret": "< Secret value >",

"fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/< Directory (tenant) id >/oauth2/token"}

dbutils.fs.mount(

source = "abfss://< container >@< Storage account >.dfs.core.windows.net/",

mount_point = "/mnt/< mountpoint >",

extra_configs = configs)

My mounting:我的安装:

在此处输入图像描述

Now you can access the ADLS folders and files with path /dbfs/mnt/< mountpoint > in your code.现在,您可以在代码中使用路径/dbfs/mnt/< mountpoint >访问 ADLS 文件夹和文件。

My files in ADLS:我在 ADLS 中的文件:

在此处输入图像描述

My files in databricks for sample:我在数据块中的文件作为示例:

在此处输入图像描述

I developed the below code to achieve this in ADLS and without mounting the adls to dbfs:我开发了以下代码以在 ADLS 中实现此目的,并且没有将 adls 安装到 dbfs:

files = dbutils.fs.ls(abfss://data)
while(files):
  for file in files:
    if file.path.endswith("/"):
      files.extend(dbutils.fs.ls(file.path))
    elif file.path.endswith(".csv"):
      if file.path.__contains__(saida_import):
        df = pd.read_csv(file.path, storage_options = adls_cred)
        tot_sh = tot_sh.append(df, sort=False)
      elif file.path.__contains__(tableau_import):
        df = pd.read_csv(file.path, storage_options = adls_cred)
        tot_tableau_sh = tot_tableau_sh.append(df, sort=False)
      elif file.path.__contains__(funnel_import):
        df = pd.read_csv(file.path, storage_options = adls_cred)
        tot_funnel_sh = tot_funnel_sh.append(df, sort=False)

    files.remove(file)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM