[英]Export a Spark Dataframe (pyspark.pandas.Dataframe) to Excel file from Azure DataBricks
I'm struggling with the export of a pyspark.pandas.Dataframe to an Excel file.我正在努力将 pyspark.pandas.Dataframe 导出到 ZC1D81AF5831044B4EZDED86 文件。
I'm working on an Azure Databricks Notebook with Pyspark.我正在使用 Pyspark 开发 Azure Databricks Notebook。 My goal is to read a csv file from Azure Data Lake Storage container and store it as a Excel file on another ADLS container.
我的目标是从 Azure Data Lake Storage 容器中读取 csv 文件,并将其作为 Excel 文件存储在另一个 ADLS 容器中。
I'm finding so many difficulties related to performances and methods.我发现很多与表演和方法有关的困难。 pyspark.pandas.Dataframe has a built-in to_excel method but with files larger than 50MB the commands ends with time-out error after 1hr (seems to be a well known problem ).
pyspark.pandas.Dataframe 有一个内置的to_excel方法,但是对于大于 50MB 的文件,命令在 1 小时后以超时错误结束。
Following you can find an example of code.您可以在下面找到代码示例。 It ends by saving the file on the DBFS (there are still problems integrating the to_excel method with Azure) and then I move the file to the ADLS.
它通过将文件保存在 DBFS 上结束(将 to_excel 方法与 Azure 集成仍然存在问题),然后我将文件移动到 ADLS。
import pyspark.pandas as ps
spark.conf.set(f"fs.azure.account.key.{storage_account_name}.dfs.core.windows.net", storage_account_key)
reference_path = f'abfss://{source_container_nae}@{storage_account_name}.dfs.core.windows.net/{file_name}'
df = ps.read_csv(reference_path, index=None)
df.to_excel(file_name, sheet_name='sheet')
pyspark.pandas.Dataframe is the suggested method by Databricks in order to work with Dataframes (it replaces koalas) but I can't find any solution to my problem, except converting the dataframe to a normal pandas one. pyspark.pandas.Dataframe is the suggested method by Databricks in order to work with Dataframes (it replaces koalas) but I can't find any solution to my problem, except converting the dataframe to a normal pandas one.
Can please someone help me?可以请人帮助我吗?
Thanks in advance!提前致谢!
UPDATE更新
Some more information of the whole pipeline.整个管道的更多信息。
I have a DataFactory pipeline that reads data from Azure Synapse, elaborate them and store them as csv files in ADLS.我有一个 DataFactory 管道,它从 Azure Synapse 读取数据,详细说明它们并将它们存储为 ADLS 中的 csv 文件。 I need DataBricks because DataFactory does not have a native sink Excel connector, I know that I can use instead Azure Functions or Kubernetes.
我需要 DataBricks,因为 DataFactory 没有本机接收器 Excel 连接器,我知道我可以使用 Azure 函数或 Kubernetes 代替。 but I started using DataBricks hoping that it was possible...
但我开始使用 DataBricks 希望它是可能的......
Hm.. it looks like you are reading the same file and saving to the same file.嗯..看起来您正在读取同一个文件并保存到同一个文件。
can you change你能改变吗
df.to_excel(file_name, sheet_name='sheet')
to至
df.to_excel("anotherfilename.xlsx", sheet_name='sheet')
I've found a solution to the problem with the pyexcelerate package:我找到了 pyexcelerate package 问题的解决方案:
from pyexcelerate import Workbook
df = # read your dataframe
values = df.columns.to_list() + list(df.values)
sheet_name = 'Sheet'
wb = Workbook()
wb.new_sheet(sheet_name, data=values)
wb.save(file_name)
In this way Databricks succeed in elaborating a 160MB dataset and exporting to Excel in 3 minutes.通过这种方式,Databricks 在 3 分钟内成功构建了 160MB 数据集并导出到 Excel。
Let me know if you find a better solution!如果您找到更好的解决方案,请告诉我!
You should not convert a big spark dataframe to pandas because you probably will not be able to allocate so much memory.您不应该将大火花 dataframe 转换为 pandas,因为您可能无法分配这么多 memory。 You can write it as a csv and it will be available to open in excel:
您可以将其写为 csv 并且可以在 excel 中打开:
df.to_csv(path=file_name, num_files=1)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.