从 Azure Databricks 读取 Excel 文件

Question

Am trying to ready Excel file ( .xlsx ) from Azure Databricks, file is in ADLS Gen 2.我正在尝试从 Azure Databricks 准备 Excel 文件 ( .xlsx )，文件位于 ADLS Gen 2 中。

Example:例子：

srcPathforParquet = "wasbs://hyxxxx@xxxxdatalakedev.blob.core.windows.net//1_Raw//abc.parquet"
srcPathforExcel = "wasbs://hyxxxx@xxxxdatalakedev.blob.core.windows.net//1_Raw//src.xlsx"

Reading parquet file from the path works fine.从路径读取镶木地板文件工作正常。

srcparquetDF = spark.read.parquet(srcPathforParquet )

Reading excel file from the path throw error: No such file or directory从路径读取excel文件抛出错误：没有这样的文件或目录

srcexcelDF = pd.read_excel(srcPathforExcel , keep_default_na=False, na_values=[''])

Answer 1

As per my repro, reading excel file from ADLS gen2 cannot accessed directly using the storage account access key.根据我的重现，无法使用存储帐户访问密钥直接访问从 ADLS gen2 读取 excel 文件。 When I tried reading excel file via ADLS gen2 URL, I got the same error message as FileNotFoundError: [Errno 2] No such file or directory: 'abfss://filesystem@chepragen2.dfs.core.windows.net/flightdata/drivers.xlsx' .当我尝试通过 ADLS gen2 URL 读取 excel 文件时，我收到与FileNotFoundError: [Errno 2] No such file or directory: 'abfss://filesystem@chepragen2.dfs.core.windows.net/flightdata/drivers.xlsx'相同的错误消息FileNotFoundError: [Errno 2] No such file or directory: 'abfss://filesystem@chepragen2.dfs.core.windows.net/flightdata/drivers.xlsx' 。

Steps to read Excel file ( .xlsx ) from Azure Databricks, file is in ADLS Gen 2:从 Azure Databricks 读取 Excel 文件 ( .xlsx ) 的步骤，文件位于 ADLS Gen 2：

Step1: Mount the ADLS Gen2 storage account.步骤 1：挂载 ADLS Gen2 存储帐户。

configs = {"fs.azure.account.auth.type": "OAuth",
           "fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
           "fs.azure.account.oauth2.client.id": "<application-id>",
           "fs.azure.account.oauth2.client.secret": dbutils.secrets.get(scope="<scope-name>",key="<service-credential-key-name>"),
           "fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/<directory-id>/oauth2/token"}

# Optionally, you can add <directory-name> to the source URI of your mount point.
dbutils.fs.mount(
  source = "abfss://<file-system-name>@<storage-account-name>.dfs.core.windows.net/",
  mount_point = "/mnt/<mount-name>",
  extra_configs = configs)

Step2: Read excel file using the mount path. Step2：使用挂载路径读取excel文件。

Reference: Azure Databricks - Azure Data Lake Storage Gen2参考： Azure Databricks - Azure Data Lake Storage Gen2

Answer 2

The method pandas.read_excel does not support using wasbs or abfss scheme URL to access the file. pandas.read_excel方法不支持使用wasbs或abfss方案 URL 来访问文件。 For more details, please refer to here欲知更多详情，请参阅此处

So if you want to access the file with pandas, I suggest you create a sas token and use https scheme with sas token to access the file or download the file as stream then read it with pandas.因此，如果您想使用 Pandas 访问文件，我建议您创建一个 sas 令牌并使用带有 sas 令牌的https方案来访问文件或将文件下载为流，然后使用 Pandas 读取它。 Meanwhile, you also mount the storage account as filesystem then access file as @CHEEKATLAPRADEEP-MSFT said.同时，您还将存储帐户挂载为文件系统，然后如@CHEEKATLAPRADEEP-MSFT 所说的那样访问文件。

For example例如

Access with sas token使用 sas 令牌访问

create sas token via Azure portal通过 Azure 门户创建 sas 令牌
Code代码

pdf=pd.read_excel('https://<account name>.dfs.core.windows.net/<file system>/<path>?<sas token>')
print(pdf)

Download the file as stream and read the file将文件下载为流并读取文件

Install package azure-storage-file-datalake and xlrd with pip in databricks安装包azure-storage-file-datalake和xlrd在databricks PIP
Code代码

import io

import pandas as pd
from azure.storage.filedatalake import BlobServiceClient
from azure.storage.filedatalake import DataLakeServiceClient

blob_service_client = DataLakeServiceClient(account_url='https://<account name>.dfs.core.windows.net/', credential='<account key>')

file_client = blob_service_client.get_file_client(file_system='test', file_path='data/sample.xlsx')
with io.BytesIO() as f:
  downloader =file_client.download_file()
  b=downloader.readinto(f)
  print(b)
  df=pd.read_excel(f)
  print(df)

Besides we also can use pyspark to read excel file.另外我们也可以使用pyspark来读取excel文件。 But we need to add jar com.crealytics:spark-excel in our environment.但是我们需要在我们的环境中添加 jar com.crealytics:spark-excel 。 For more details, please refer to here and here欲了解更多详情，请参阅此处和此处

For example例如

Add package com.crealytics:spark-excel_2.12:0.13.1 via maven.通过 maven 添加包com.crealytics:spark-excel_2.12:0.13.1 。 Besides, please note that if you use scala 2.11, please add package com.crealytics:spark-excel_2.11:0.13.1另外请注意，如果您使用的是scala 2.11，请添加包com.crealytics:spark-excel_2.11:0.13.1
Code代码

spark._jsc.hadoopConfiguration().set("fs.azure.account.key.<account name>.dfs.core.windows.net",'<account key>')

print("use spark")
df=sqlContext.read.format("com.crealytics.spark.excel") \
        .option("header", "true") \
        .load('abfss://test@testadls05.dfs.core.windows.net/data/sample.xlsx')

df.show()

Answer 3

From my experience, the following are the basic steps that worked for me in reading the excel file from ADLS2 in the databricks :根据我的经验，以下是我从 databricks 中的 ADLS2 读取 excel 文件的基本步骤：

Installed the following library on my Databricks cluster .在我的 Databricks 集群上安装了以下库。

com.crealytics:spark-excel_2.12:0.13.6 com.crealytics:spark-excel_2.12:0.13.6

Added the below spark configuration .添加了以下火花配置。

spark.conf.set(adlsAccountKeyName,adlsAccountKeyValue) spark.conf.set(adlsAccountKeyName,adlsAccountKeyValue)

adlsAccountKeyName --> fs.azure.account.key.YOUR_ADLS_ACCOUNT_NAME>.blob.core.windows.net adlsAccountKeyValue --> sas key of your adls account adlsAccountKeyName --> fs.azure.account.key.YOUR_ADLS_ACCOUNT_NAME>.blob.core.windows.net adlsAccountKeyValue --> 您的 adls 帐户的 sas 密钥

Used the below code to get the spark dataframe out of my excel file in ADLS .使用以下代码从 ADLS 中的 excel 文件中获取 spark 数据框。

 myDataFrame = (spark.read.format("com.crealytics.spark.excel") .option("dataAddress", "'Sheetname'!") .option("header", "true") .option("treatEmptyValuesAsNulls", "true") .option("inferSchema", "false") .option("addColorColumns", "false") .option("startColumn", 0) .option("endColumn", 99) .option("timestampFormat", "dd-MM-yyyy HH:mm:ss") .load(FullFilePathExcel) )

从 Azure Databricks 读取 Excel 文件

问题描述

3 个解决方案

解决方案1
3 2020-09-07 08:27:19

解决方案2
2 已采纳 2020-09-08 02:05:34

解决方案3
0 2021-03-24 16:01:11

从 Azure Databricks 读取 Excel 文件

问题描述

3 个解决方案

解决方案1 3 2020-09-07 08:27:19

解决方案2 2 已采纳 2020-09-08 02:05:34

解决方案3 0 2021-03-24 16:01:11

解决方案1
3 2020-09-07 08:27:19

解决方案2
2 已采纳 2020-09-08 02:05:34

解决方案3
0 2021-03-24 16:01:11