简体   繁体   English

Databricks - pyspark.pandas.Dataframe.to_excel 不识别 abfss 协议

[英]Databricks - pyspark.pandas.Dataframe.to_excel does not recognize abfss protocol

I want to save a Dataframe (pyspark.pandas.Dataframe) as an Excel file on the Azure Data Lake Gen2 using Azure Databricks in Python. I've switched to the pyspark.pandas.Dataframe because it is the recommended one since Spark 3.2. I want to save a Dataframe (pyspark.pandas.Dataframe) as an Excel file on the Azure Data Lake Gen2 using Azure Databricks in Python. I've switched to the pyspark.pandas.Dataframe because it is the recommended one since Spark 3.2.

There's a method called to_excel ( here the doc) that allows to save a file to a container in ADL but I'm facing problems with the file system access protocols.有一种名为 to_excel( 此处为文档)的方法允许将文件保存到 ADL 中的容器中,但我遇到了文件系统访问协议的问题。 From the same class I use the methods to_csv and to_parquet using abfss and I'd like to use the same for the excel.从同一个 class 我使用方法 to_csv 和 to_parquet 使用 abfss 我想对 excel 使用相同的方法。

So when I try so save it using:所以当我尝试使用以下方法保存它时:

import pyspark.pandas as ps
# Omit the df initialization
file_name = "abfss://CONTAINER@SERVICEACCOUNT.dfs.core.windows.net/FILE.xlsx"
sheet = "test"
df.to_excel(file_name, test)

I get the error from fsspec:我从 fsspec 得到错误:

ValueError: Protocol not known: abfss

Can someone please help me?有人可以帮帮我吗?

Thanks in advance!提前致谢!

The pandas dataframe does not support the protocol. pandas 数据框不支持该协议。 It seems on Databricks you can only access and write the file on abfss via Spark dataframe.在 Databricks 上,您似乎只能通过 Spark 数据帧访问和写入 abfss 上的文件。 So, the solution is to write file locally and manually move to abfss.所以,解决办法是在本地写文件,手动移到abfss。 See this answer here . 在这里看到这个答案。

You can not save it directly but you can have it as its stored in temp location and move it to your directory.您不能直接保存它,但可以将其存储在临时位置并将其移动到您的目录中。 My code piece is:我的代码是:

import xlsxwriter import pandas as pd1 

workbook = xlsxwriter.Workbook('data_checks_output.xlsx') 

worksheet = workbook.add_worksheet('top_rows') 

Create a Pandas Excel writer using XlsxWriter as the engine.使用 XlsxWriter 作为引擎创建一个 Pandas Excel 编写器。

writer = pd1.ExcelWriter('data_checks_output.xlsx', engine='xlsxwriter') 

output = dataset.limit(10) 
output = output.toPandas() 
output.to_excel(writer, sheet_name='top_rows',startrow=row_number)

writer.save()

After write.save写入后保存

run below code, which is nothing but moves temp location of file to your desginated location.运行下面的代码,这只不过是将文件的临时位置移动到您指定的位置。

Below code does the work of moving files.下面的代码完成了移动文件的工作。

%sh
sudo mv file_name.xlsx /dbfs/mnt/fpmount/

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM