[英]Renaming spark output csv in azure blob storage
I have a Databricks notebook setup that works as the following;我有一个 Databricks 笔记本设置,其工作方式如下;
My problem is, that you can not name the file output file, where I need a static csv filename.我的问题是,您无法命名文件输出文件,我需要一个静态 csv 文件名。
Is there way to rename this in pyspark?有没有办法在pyspark中重命名它?
## Blob Storage account information
storage_account_name = ""
storage_account_access_key = ""
## File location and File type
file_location = "path/.blob.core.windows.net/Databricks_Files/input"
file_location_new = "path/.blob.core.windows.net/Databricks_Files/out"
file_type = "csv"
## Connection string to connect to blob storage
spark.conf.set(
"fs.azure.account.key."+storage_account_name+".blob.core.windows.net",
storage_account_access_key)
Followed by outputting file after data transformation数据转换后输出文件
dfspark.coalesce(1).write.format('com.databricks.spark.csv') \
.mode('overwrite').option("header", "true").save(file_location_new)
Where the file is then write as "part-00000-tid-336943946930983.....csv"然后将文件写为“part-00000-tid-336943946930983.....csv”
Where as a the goal is to have "Output.csv"目标是拥有“Output.csv”
Another approach I looked at was just recreating this in python but have not come across in the documentation yet of how to output the file back to blob storage.我看到的另一种方法只是在 python 中重新创建它,但在文档中还没有遇到如何将文件输出回 blob 存储的方法。
I know the method to retrieve from Blob storage is .get_blob_to_path via microsoft.docs我知道从 Blob 存储中检索的方法是.get_blob_to_path via microsoft.docs
Any help here is greatly appreciated.非常感谢这里的任何帮助。
Hadoop/Spark will parallel output the compute result per partition into one file, so you will see many part-<number>-....
files in a HDFS output path like Output/
named by you. Hadoop/Spark 会将每个分区的计算结果并行输出到一个文件中,因此您会在您命名的
Output/
HDFS 输出路径中看到许多part-<number>-....
文件。
If you want to output all results of a computing into one file, you can merge them via the command hadoop fs -getmerge /output1/part* /output2/Output.csv
or set the number of reduce processes with 1
like using coalesce(1)
function.如果你想将一次计算的所有结果输出到一个文件中,你可以通过命令
hadoop fs -getmerge /output1/part* /output2/Output.csv
合并它们,或者像使用coalesce(1)
1
一样将reduce进程数设置为1 coalesce(1)
功能。
So in your scenario, you only need to adjust the order of calling these functions to make the coalease
function called at the front of save
function, as below.所以在你的场景中,你只需要调整这些函数的调用顺序,让
coalease
函数在save
函数的前面被调用,如下。
dfspark.write.format('com.databricks.spark.csv') \
.mode('overwrite').option("header", "true").coalesce(1).save(file_location_new)
The coalesce
and repartition
do not help with saving the dataframe into 1 normally named file. coalesce
和repartition
无助于将数据帧保存到 1 个正常命名的文件中。
I ended up just renaming the 1 csv file and deleting the folder with log:我最终只是重命名了 1 个 csv 文件并删除了带有日志的文件夹:
def save_csv(df, location, filename):
outputPath = os.path.join(location, filename + '_temp.csv')
df.repartition(1).write.format("com.databricks.spark.csv").mode("overwrite").options(header="true", inferSchema="true").option("delimiter", "\t").save(outputPath)
csv_files = os.listdir(os.path.join('/dbfs', outputPath))
# moving the parquet-like temp csv file into normally named one
for file in csv_files:
if file[-4:] == '.csv':
dbutils.fs.mv(os.path.join(outputPath,file) , os.path.join(location, filename))
dbutils.fs.rm(outputPath, True)
# using save_csv
save_csv_location = 'mnt/.....'
save_csv(df, save_csv_location, 'name.csv')
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.