[英]Reading Excel file from Azure Databricks
Am trying to ready Excel file ( .xlsx
) from Azure Databricks, file is in ADLS Gen 2.我正在尝试从 Azure Databricks 准备 Excel 文件 ( .xlsx
),文件位于 ADLS Gen 2 中。
Example:例子:
srcPathforParquet = "wasbs://hyxxxx@xxxxdatalakedev.blob.core.windows.net//1_Raw//abc.parquet"
srcPathforExcel = "wasbs://hyxxxx@xxxxdatalakedev.blob.core.windows.net//1_Raw//src.xlsx"
Reading parquet file from the path works fine.从路径读取镶木地板文件工作正常。
srcparquetDF = spark.read.parquet(srcPathforParquet )
Reading excel file from the path throw error: No such file or directory从路径读取excel文件抛出错误:没有这样的文件或目录
srcexcelDF = pd.read_excel(srcPathforExcel , keep_default_na=False, na_values=[''])
As per my repro, reading excel file from ADLS gen2 cannot accessed directly using the storage account access key.根据我的重现,无法使用存储帐户访问密钥直接访问从 ADLS gen2 读取 excel 文件。 When I tried reading excel file via ADLS gen2 URL, I got the same error message as FileNotFoundError: [Errno 2] No such file or directory: 'abfss://filesystem@chepragen2.dfs.core.windows.net/flightdata/drivers.xlsx'
.当我尝试通过 ADLS gen2 URL 读取 excel 文件时,我收到与FileNotFoundError: [Errno 2] No such file or directory: 'abfss://filesystem@chepragen2.dfs.core.windows.net/flightdata/drivers.xlsx'
相同的错误消息FileNotFoundError: [Errno 2] No such file or directory: 'abfss://filesystem@chepragen2.dfs.core.windows.net/flightdata/drivers.xlsx'
。
Steps to read Excel file ( .xlsx
) from Azure Databricks, file is in ADLS Gen 2:从 Azure Databricks 读取 Excel 文件 ( .xlsx
) 的步骤,文件位于 ADLS Gen 2:
Step1: Mount the ADLS Gen2 storage account.步骤 1:挂载 ADLS Gen2 存储帐户。
configs = {"fs.azure.account.auth.type": "OAuth",
"fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
"fs.azure.account.oauth2.client.id": "<application-id>",
"fs.azure.account.oauth2.client.secret": dbutils.secrets.get(scope="<scope-name>",key="<service-credential-key-name>"),
"fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/<directory-id>/oauth2/token"}
# Optionally, you can add <directory-name> to the source URI of your mount point.
dbutils.fs.mount(
source = "abfss://<file-system-name>@<storage-account-name>.dfs.core.windows.net/",
mount_point = "/mnt/<mount-name>",
extra_configs = configs)
Step2: Read excel file using the mount path. Step2:使用挂载路径读取excel文件。
Reference: Azure Databricks - Azure Data Lake Storage Gen2参考: Azure Databricks - Azure Data Lake Storage Gen2
The method pandas.read_excel
does not support using wasbs
or abfss
scheme URL to access the file. pandas.read_excel
方法不支持使用wasbs
或abfss
方案 URL 来访问文件。 For more details, please refer to here欲知更多详情,请参阅此处
So if you want to access the file with pandas, I suggest you create a sas token and use https
scheme with sas token to access the file or download the file as stream then read it with pandas.因此,如果您想使用 Pandas 访问文件,我建议您创建一个 sas 令牌并使用带有 sas 令牌的https
方案来访问文件或将文件下载为流,然后使用 Pandas 读取它。 Meanwhile, you also mount the storage account as filesystem then access file as @CHEEKATLAPRADEEP-MSFT said.同时,您还将存储帐户挂载为文件系统,然后如@CHEEKATLAPRADEEP-MSFT 所说的那样访问文件。
For example例如
pdf=pd.read_excel('https://<account name>.dfs.core.windows.net/<file system>/<path>?<sas token>')
print(pdf)
Install package azure-storage-file-datalake
and xlrd
with pip in databricks安装包azure-storage-file-datalake
和xlrd
在databricks PIP
Code代码
import io
import pandas as pd
from azure.storage.filedatalake import BlobServiceClient
from azure.storage.filedatalake import DataLakeServiceClient
blob_service_client = DataLakeServiceClient(account_url='https://<account name>.dfs.core.windows.net/', credential='<account key>')
file_client = blob_service_client.get_file_client(file_system='test', file_path='data/sample.xlsx')
with io.BytesIO() as f:
downloader =file_client.download_file()
b=downloader.readinto(f)
print(b)
df=pd.read_excel(f)
print(df)
Besides we also can use pyspark to read excel file.另外我们也可以使用pyspark来读取excel文件。 But we need to add jar com.crealytics:spark-excel
in our environment.但是我们需要在我们的环境中添加 jar com.crealytics:spark-excel
。 For more details, please refer to here and here欲了解更多详情,请参阅此处和此处
For example例如
Add package com.crealytics:spark-excel_2.12:0.13.1
via maven.通过 maven 添加包com.crealytics:spark-excel_2.12:0.13.1
。 Besides, please note that if you use scala 2.11, please add package com.crealytics:spark-excel_2.11:0.13.1
另外请注意,如果您使用的是scala 2.11,请添加包com.crealytics:spark-excel_2.11:0.13.1
Code代码
spark._jsc.hadoopConfiguration().set("fs.azure.account.key.<account name>.dfs.core.windows.net",'<account key>')
print("use spark")
df=sqlContext.read.format("com.crealytics.spark.excel") \
.option("header", "true") \
.load('abfss://test@testadls05.dfs.core.windows.net/data/sample.xlsx')
df.show()
From my experience, the following are the basic steps that worked for me in reading the excel file from ADLS2 in the databricks :根据我的经验,以下是我从 databricks 中的 ADLS2 读取 excel 文件的基本步骤:
com.crealytics:spark-excel_2.12:0.13.6 com.crealytics:spark-excel_2.12:0.13.6
spark.conf.set(adlsAccountKeyName,adlsAccountKeyValue) spark.conf.set(adlsAccountKeyName,adlsAccountKeyValue)
adlsAccountKeyName --> fs.azure.account.key.YOUR_ADLS_ACCOUNT_NAME>.blob.core.windows.net adlsAccountKeyValue --> sas key of your adls account adlsAccountKeyName --> fs.azure.account.key.YOUR_ADLS_ACCOUNT_NAME>.blob.core.windows.net adlsAccountKeyValue --> 您的 adls 帐户的 sas 密钥
myDataFrame = (spark.read.format("com.crealytics.spark.excel") .option("dataAddress", "'Sheetname'!") .option("header", "true") .option("treatEmptyValuesAsNulls", "true") .option("inferSchema", "false") .option("addColorColumns", "false") .option("startColumn", 0) .option("endColumn", 99) .option("timestampFormat", "dd-MM-yyyy HH:mm:ss") .load(FullFilePathExcel) )
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.