[英]Write a csv file into azure blob storage
I am trying use pyspark to analyze my data on databricks notebooks.我正在尝试使用 pyspark 来分析我在 databricks 笔记本上的数据。 Blob storage has been mounted on the databricks cluster and after ananlyzing, would like to write csv back into blob storage.
Blob 存储已挂载在 databricks 集群上,经过分析,想将 csv 写回 Blob 存储。 As pyspark working in distributed fashion, csv file is broken into small blocks and written on the blob storage.
由于 pyspark 以分布式方式工作,csv 文件被分成小块并写入 blob 存储。 How to overcome this and write as a single csv file on blob when we do analysis using pyspark.
当我们使用 pyspark 进行分析时,如何克服这个问题并将其作为单个 csv 文件写入 blob。 Thanks.
谢谢。
Also please let me know, whether this can be overcome if we move to Azure datalake storage Gen2? 还请让我知道,如果我们迁移到Azure Datalake存储Gen2,是否可以克服? More optimized and csv can be written as one single file?
更优化了,csv可以作为一个文件写入吗? As I mentioned earlier, analytics is done on databricks notebook with pyspark.
正如我之前提到的,分析是通过pyspark在databricks笔记本上完成的。 Thanks.
谢谢。
Do you really want a single file? 您真的要一个文件吗? If yes, the only way you can overcome it by merging all the small csv files into a single csv file.
如果是,则可以通过将所有小的csv文件合并为单个csv文件来克服此问题的唯一方法。 You can make use of map function on the databricks cluster to merge it or may be you can use some background job to do the same.
您可以使用databricks群集上的map函数将其合并,或者可以使用某些后台作业来完成此操作。
Have a look here: https://forums.databricks.com/questions/14851/how-to-concat-lots-of-1mb-cvs-files-in-pyspark.html 在这里看看: https : //forums.databricks.com/questions/14851/how-to-concat-lots-of-1mb-cvs-files-in-pyspark.html
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.