简体   繁体   中英

Reading Excel file from Azure Databricks

Am trying to ready Excel file ( .xlsx ) from Azure Databricks, file is in ADLS Gen 2.

Example:

srcPathforParquet = "wasbs://hyxxxx@xxxxdatalakedev.blob.core.windows.net//1_Raw//abc.parquet"
srcPathforExcel = "wasbs://hyxxxx@xxxxdatalakedev.blob.core.windows.net//1_Raw//src.xlsx"

Reading parquet file from the path works fine.

srcparquetDF = spark.read.parquet(srcPathforParquet )

Reading excel file from the path throw error: No such file or directory

srcexcelDF = pd.read_excel(srcPathforExcel , keep_default_na=False, na_values=[''])

As per my repro, reading excel file from ADLS gen2 cannot accessed directly using the storage account access key. When I tried reading excel file via ADLS gen2 URL, I got the same error message as FileNotFoundError: [Errno 2] No such file or directory: 'abfss://filesystem@chepragen2.dfs.core.windows.net/flightdata/drivers.xlsx' .

在此处输入图片说明

Steps to read Excel file ( .xlsx ) from Azure Databricks, file is in ADLS Gen 2:

Step1: Mount the ADLS Gen2 storage account.

configs = {"fs.azure.account.auth.type": "OAuth",
           "fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
           "fs.azure.account.oauth2.client.id": "<application-id>",
           "fs.azure.account.oauth2.client.secret": dbutils.secrets.get(scope="<scope-name>",key="<service-credential-key-name>"),
           "fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/<directory-id>/oauth2/token"}

# Optionally, you can add <directory-name> to the source URI of your mount point.
dbutils.fs.mount(
  source = "abfss://<file-system-name>@<storage-account-name>.dfs.core.windows.net/",
  mount_point = "/mnt/<mount-name>",
  extra_configs = configs)

Step2: Read excel file using the mount path.

在此处输入图片说明

Reference: Azure Databricks - Azure Data Lake Storage Gen2

The method pandas.read_excel does not support using wasbs or abfss scheme URL to access the file. For more details, please refer to here

So if you want to access the file with pandas, I suggest you create a sas token and use https scheme with sas token to access the file or download the file as stream then read it with pandas. Meanwhile, you also mount the storage account as filesystem then access file as @CHEEKATLAPRADEEP-MSFT said.

For example

  • Access with sas token
  1. create sas token via Azure portal在此处输入图片说明

  2. Code

pdf=pd.read_excel('https://<account name>.dfs.core.windows.net/<file system>/<path>?<sas token>')
print(pdf)

在此处输入图片说明

  • Download the file as stream and read the file
  1. Install package azure-storage-file-datalake and xlrd with pip in databricks

  2. Code

import io

import pandas as pd
from azure.storage.filedatalake import BlobServiceClient
from azure.storage.filedatalake import DataLakeServiceClient

blob_service_client = DataLakeServiceClient(account_url='https://<account name>.dfs.core.windows.net/', credential='<account key>')

file_client = blob_service_client.get_file_client(file_system='test', file_path='data/sample.xlsx')
with io.BytesIO() as f:
  downloader =file_client.download_file()
  b=downloader.readinto(f)
  print(b)
  df=pd.read_excel(f)
  print(df)

在此处输入图片说明

Besides we also can use pyspark to read excel file. But we need to add jar com.crealytics:spark-excel in our environment. For more details, please refer to here and here

For example

  1. Add package com.crealytics:spark-excel_2.12:0.13.1 via maven. Besides, please note that if you use scala 2.11, please add package com.crealytics:spark-excel_2.11:0.13.1

  2. Code

spark._jsc.hadoopConfiguration().set("fs.azure.account.key.<account name>.dfs.core.windows.net",'<account key>')

print("use spark")
df=sqlContext.read.format("com.crealytics.spark.excel") \
        .option("header", "true") \
        .load('abfss://test@testadls05.dfs.core.windows.net/data/sample.xlsx')

df.show()

在此处输入图片说明

From my experience, the following are the basic steps that worked for me in reading the excel file from ADLS2 in the databricks :

  • Installed the following library on my Databricks cluster .

com.crealytics:spark-excel_2.12:0.13.6

  • Added the below spark configuration .

spark.conf.set(adlsAccountKeyName,adlsAccountKeyValue)

adlsAccountKeyName --> fs.azure.account.key.YOUR_ADLS_ACCOUNT_NAME>.blob.core.windows.net adlsAccountKeyValue --> sas key of your adls account

  • Used the below code to get the spark dataframe out of my excel file in ADLS .
 myDataFrame = (spark.read.format("com.crealytics.spark.excel") .option("dataAddress", "'Sheetname'!") .option("header", "true") .option("treatEmptyValuesAsNulls", "true") .option("inferSchema", "false") .option("addColorColumns", "false") .option("startColumn", 0) .option("endColumn", 99) .option("timestampFormat", "dd-MM-yyyy HH:mm:ss") .load(FullFilePathExcel) )

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM