简体   繁体   English

将 Spark Dataframe (pyspark.pandas.Dataframe) 从 Z3A580F142203676F53F 文件导出到 Excel 文件

[英]Export a Spark Dataframe (pyspark.pandas.Dataframe) to Excel file from Azure DataBricks

I'm struggling with the export of a pyspark.pandas.Dataframe to an Excel file.我正在努力将 pyspark.pandas.Dataframe 导出到 ZC1D81AF5831044B4EZDED86 文件。

I'm working on an Azure Databricks Notebook with Pyspark.我正在使用 Pyspark 开发 Azure Databricks Notebook。 My goal is to read a csv file from Azure Data Lake Storage container and store it as a Excel file on another ADLS container.我的目标是从 Azure Data Lake Storage 容器中读取 csv 文件,并将其作为 Excel 文件存储在另一个 ADLS 容器中。

I'm finding so many difficulties related to performances and methods.我发现很多与表演和方法有关的困难。 pyspark.pandas.Dataframe has a built-in to_excel method but with files larger than 50MB the commands ends with time-out error after 1hr (seems to be a well known problem ). pyspark.pandas.Dataframe 有一个内置的to_excel方法,但是对于大于 50MB 的文件,命令在 1 小时后以超时错误结束。

Following you can find an example of code.您可以在下面找到代码示例。 It ends by saving the file on the DBFS (there are still problems integrating the to_excel method with Azure) and then I move the file to the ADLS.它通过将文件保存在 DBFS 上结束(将 to_excel 方法与 Azure 集成仍然存在问题),然后我将文件移动到 ADLS。

import pyspark.pandas as ps
spark.conf.set(f"fs.azure.account.key.{storage_account_name}.dfs.core.windows.net", storage_account_key)

reference_path = f'abfss://{source_container_nae}@{storage_account_name}.dfs.core.windows.net/{file_name}'

df = ps.read_csv(reference_path, index=None)

df.to_excel(file_name, sheet_name='sheet')

pyspark.pandas.Dataframe is the suggested method by Databricks in order to work with Dataframes (it replaces koalas) but I can't find any solution to my problem, except converting the dataframe to a normal pandas one. pyspark.pandas.Dataframe is the suggested method by Databricks in order to work with Dataframes (it replaces koalas) but I can't find any solution to my problem, except converting the dataframe to a normal pandas one.

Can please someone help me?可以请人帮助我吗?

Thanks in advance!提前致谢!

UPDATE更新

Some more information of the whole pipeline.整个管道的更多信息。

I have a DataFactory pipeline that reads data from Azure Synapse, elaborate them and store them as csv files in ADLS.我有一个 DataFactory 管道,它从 Azure Synapse 读取数据,详细说明它们并将它们存储为 ADLS 中的 csv 文件。 I need DataBricks because DataFactory does not have a native sink Excel connector, I know that I can use instead Azure Functions or Kubernetes.我需要 DataBricks,因为 DataFactory 没有本机接收器 Excel 连接器,我知道我可以使用 Azure 函数或 Kubernetes 代替。 but I started using DataBricks hoping that it was possible...但我开始使用 DataBricks 希望它是可能的......

Hm.. it looks like you are reading the same file and saving to the same file.嗯..看起来您正在读取同一个文件并保存到同一个文件。

can you change你能改变吗

df.to_excel(file_name, sheet_name='sheet')

to

df.to_excel("anotherfilename.xlsx", sheet_name='sheet')

I've found a solution to the problem with the pyexcelerate package:我找到了 pyexcelerate package 问题的解决方案:

from pyexcelerate import Workbook

df = # read your dataframe

values = df.columns.to_list() + list(df.values)
sheet_name = 'Sheet'

wb = Workbook()
wb.new_sheet(sheet_name, data=values)
wb.save(file_name)

In this way Databricks succeed in elaborating a 160MB dataset and exporting to Excel in 3 minutes.通过这种方式,Databricks 在 3 分钟内成功构建了 160MB 数据集并导出到 Excel。

Let me know if you find a better solution!如果您找到更好的解决方案,请告诉我!

You should not convert a big spark dataframe to pandas because you probably will not be able to allocate so much memory.您不应该将大火花 dataframe 转换为 pandas,因为您可能无法分配这么多 memory。 You can write it as a csv and it will be available to open in excel:您可以将其写为 csv 并且可以在 excel 中打开:

df.to_csv(path=file_name, num_files=1)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何在 Databricks 上将数据框导出为 excel - How to export dataframe as excel on Databricks Databricks - pyspark.pandas.Dataframe.to_excel 不识别 abfss 协议 - Databricks - pyspark.pandas.Dataframe.to_excel does not recognize abfss protocol 将 pandas on spark API dataframe 保存到 azure 数据块中的新表 - Save pandas on spark API dataframe to a new table in azure databricks 如何使用databricks中的pyspark将spark数据帧中的所有行数据提取到文件中 - How to fetch all rows data from spark dataframe to a file using pyspark in databricks 在 Azure DataBricks 上嵌套 JSON 到 Flat PySpark Dataframe - Nested JSON to Flat PySpark Dataframe on Azure DataBricks 如何在Azure Databricks Spark中从DataFrame获取特定的行和列 - How to get a specific row and column from a DataFrame in Azure Databricks Spark TypeError在Pyspark中将Pandas数据框转换为Spark数据框 - TypeError converting a Pandas Dataframe to Spark Dataframe in Pyspark 从火花 dataframe 到 pandas dataframe - from spark dataframe to pandas dataframe 鉴于未安装“openpyxl”模块,将 pyspark 中的数据框导出到 excel 文件 - Export dataframe in pyspark to excel file given the 'openpyxl' module is not installed 在 Databricks 中将 cache() 和 count() 应用于 Spark Dataframe 非常慢 [pyspark] - applying cache() and count() to Spark Dataframe in Databricks is very slow [pyspark]
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM