简体   繁体   English

从 Azure Databricks 读取 Excel 文件

[英]Reading Excel file from Azure Databricks

Am trying to ready Excel file ( .xlsx ) from Azure Databricks, file is in ADLS Gen 2.我正在尝试从 Azure Databricks 准备 Excel 文件 ( .xlsx ),文件位于 ADLS Gen 2 中。

Example:例子:

srcPathforParquet = "wasbs://hyxxxx@xxxxdatalakedev.blob.core.windows.net//1_Raw//abc.parquet"
srcPathforExcel = "wasbs://hyxxxx@xxxxdatalakedev.blob.core.windows.net//1_Raw//src.xlsx"

Reading parquet file from the path works fine.从路径读取镶木地板文件工作正常。

srcparquetDF = spark.read.parquet(srcPathforParquet )

Reading excel file from the path throw error: No such file or directory从路径读取excel文件抛出错误:没有这样的文件或目录

srcexcelDF = pd.read_excel(srcPathforExcel , keep_default_na=False, na_values=[''])

As per my repro, reading excel file from ADLS gen2 cannot accessed directly using the storage account access key.根据我的重现,无法使用存储帐户访问密钥直接访问从 ADLS gen2 读取 excel 文件。 When I tried reading excel file via ADLS gen2 URL, I got the same error message as FileNotFoundError: [Errno 2] No such file or directory: 'abfss://filesystem@chepragen2.dfs.core.windows.net/flightdata/drivers.xlsx' .当我尝试通过 ADLS gen2 URL 读取 excel 文件时,我收到与FileNotFoundError: [Errno 2] No such file or directory: 'abfss://filesystem@chepragen2.dfs.core.windows.net/flightdata/drivers.xlsx'相同的错误消息FileNotFoundError: [Errno 2] No such file or directory: 'abfss://filesystem@chepragen2.dfs.core.windows.net/flightdata/drivers.xlsx'

在此处输入图片说明

Steps to read Excel file ( .xlsx ) from Azure Databricks, file is in ADLS Gen 2:从 Azure Databricks 读取 Excel 文件 ( .xlsx ) 的步骤,文件位于 ADLS Gen 2:

Step1: Mount the ADLS Gen2 storage account.步骤 1:挂载 ADLS Gen2 存储帐户。

configs = {"fs.azure.account.auth.type": "OAuth",
           "fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
           "fs.azure.account.oauth2.client.id": "<application-id>",
           "fs.azure.account.oauth2.client.secret": dbutils.secrets.get(scope="<scope-name>",key="<service-credential-key-name>"),
           "fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/<directory-id>/oauth2/token"}

# Optionally, you can add <directory-name> to the source URI of your mount point.
dbutils.fs.mount(
  source = "abfss://<file-system-name>@<storage-account-name>.dfs.core.windows.net/",
  mount_point = "/mnt/<mount-name>",
  extra_configs = configs)

Step2: Read excel file using the mount path. Step2:使用挂载路径读取excel文件。

在此处输入图片说明

Reference: Azure Databricks - Azure Data Lake Storage Gen2参考: Azure Databricks - Azure Data Lake Storage Gen2

The method pandas.read_excel does not support using wasbs or abfss scheme URL to access the file. pandas.read_excel方法不支持使用wasbsabfss方案 URL 来访问文件。 For more details, please refer to here欲知更多详情,请参阅此处

So if you want to access the file with pandas, I suggest you create a sas token and use https scheme with sas token to access the file or download the file as stream then read it with pandas.因此,如果您想使用 Pandas 访问文件,我建议您创建一个 sas 令牌并使用带有 sas 令牌的https方案来访问文件或将文件下载为流,然后使用 Pandas 读取它。 Meanwhile, you also mount the storage account as filesystem then access file as @CHEEKATLAPRADEEP-MSFT said.同时,您还将存储帐户挂载为文件系统,然后如@CHEEKATLAPRADEEP-MSFT 所说的那样访问文件。

For example例如

  • Access with sas token使用 sas 令牌访问
  1. create sas token via Azure portal通过 Azure 门户创建 sas 令牌在此处输入图片说明

  2. Code代码

pdf=pd.read_excel('https://<account name>.dfs.core.windows.net/<file system>/<path>?<sas token>')
print(pdf)

在此处输入图片说明

  • Download the file as stream and read the file将文件下载为流并读取文件
  1. Install package azure-storage-file-datalake and xlrd with pip in databricks安装包azure-storage-file-datalakexlrd在databricks PIP

  2. Code代码

import io

import pandas as pd
from azure.storage.filedatalake import BlobServiceClient
from azure.storage.filedatalake import DataLakeServiceClient

blob_service_client = DataLakeServiceClient(account_url='https://<account name>.dfs.core.windows.net/', credential='<account key>')

file_client = blob_service_client.get_file_client(file_system='test', file_path='data/sample.xlsx')
with io.BytesIO() as f:
  downloader =file_client.download_file()
  b=downloader.readinto(f)
  print(b)
  df=pd.read_excel(f)
  print(df)

在此处输入图片说明

Besides we also can use pyspark to read excel file.另外我们也可以使用pyspark来读取excel文件。 But we need to add jar com.crealytics:spark-excel in our environment.但是我们需要在我们的环境中添加 jar com.crealytics:spark-excel For more details, please refer to here and here欲了解更多详情,请参阅此处此处

For example例如

  1. Add package com.crealytics:spark-excel_2.12:0.13.1 via maven.通过 maven 添加包com.crealytics:spark-excel_2.12:0.13.1 Besides, please note that if you use scala 2.11, please add package com.crealytics:spark-excel_2.11:0.13.1另外请注意,如果您使用的是scala 2.11,请添加包com.crealytics:spark-excel_2.11:0.13.1

  2. Code代码

spark._jsc.hadoopConfiguration().set("fs.azure.account.key.<account name>.dfs.core.windows.net",'<account key>')

print("use spark")
df=sqlContext.read.format("com.crealytics.spark.excel") \
        .option("header", "true") \
        .load('abfss://test@testadls05.dfs.core.windows.net/data/sample.xlsx')

df.show()

在此处输入图片说明

From my experience, the following are the basic steps that worked for me in reading the excel file from ADLS2 in the databricks :根据我的经验,以下是我从 databricks 中的 ADLS2 读取 excel 文件的基本步骤:

  • Installed the following library on my Databricks cluster .在我的 Databricks 集群上安装了以下库

com.crealytics:spark-excel_2.12:0.13.6 com.crealytics:spark-excel_2.12:0.13.6

  • Added the below spark configuration .添加了以下火花配置

spark.conf.set(adlsAccountKeyName,adlsAccountKeyValue) spark.conf.set(adlsAccountKeyName,adlsAccountKeyValue)

adlsAccountKeyName --> fs.azure.account.key.YOUR_ADLS_ACCOUNT_NAME>.blob.core.windows.net adlsAccountKeyValue --> sas key of your adls account adlsAccountKeyName --> fs.azure.account.key.YOUR_ADLS_ACCOUNT_NAME>.blob.core.windows.net adlsAccountKeyValue --> 您的 adls 帐户的 sas 密钥

  • Used the below code to get the spark dataframe out of my excel file in ADLS .使用以下代码从 ADLS 中的 excel 文件中获取 spark 数据框
 myDataFrame = (spark.read.format("com.crealytics.spark.excel") .option("dataAddress", "'Sheetname'!") .option("header", "true") .option("treatEmptyValuesAsNulls", "true") .option("inferSchema", "false") .option("addColorColumns", "false") .option("startColumn", 0) .option("endColumn", 99) .option("timestampFormat", "dd-MM-yyyy HH:mm:ss") .load(FullFilePathExcel) )

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 将附加文本文件从数据块写入 azure adls gen1 - writing appending text file from databricks to azure adls gen1 .jpg 文件未从 blob 存储(Azure 数据湖)加载到数据块中 - .jpg file not loading in databricks from blob storage (Azure data lake) 如何从 Azure Data Lake Gen2 访问 XML 文件并将其转换为 Azure Databricks 中的数据帧? - How to access XML file from Azure Data Lake Gen2 and transform it into data-frame in Azure Databricks? 从两个Excel读取并创建差异文件 - Reading from two excel and creating difference file 从 excel 文件中读取特定行而不完全读取它 - Reading particular row from excel file without reading it completely Python 脚本将文件从 AWS S3 传输到数据块中的 Azure BLOB - Python Script to transfer file from AWS S3 to Azure BLOB in databricks Azure Databricks:将文件保存在 Azure Datalake 目录文件夹中 - Azure Databricks : Save file in Azure Datalake directory folder 从 Azure Databricks 笔记本登录到 Azure ML 工作区 - Login to Azure ML workspace from Azure Databricks notebook 从 Azure Databricks 读取 Azure Datalake Gen2 映像 - Read Azure Datalake Gen2 images from Azure Databricks 用pylightxl读取excel文件 - Reading excel file with pylightxl
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM