在 azure blob 存储中重命名 spark 输出 csv

Question

I have a Databricks notebook setup that works as the following;我有一个 Databricks 笔记本设置，其工作方式如下；

pyspark connection details to Blob storage account到 Blob 存储帐户的 pyspark 连接详细信息
Read file through spark dataframe通过 spark dataframe 读取文件
convert to pandas Df转换为熊猫 Df
data modelling on pandas Df pandas Df 数据建模
convert to spark Df转换为火花 Df
write to blob storage in single file在单个文件中写入 blob 存储

My problem is, that you can not name the file output file, where I need a static csv filename.我的问题是，您无法命名文件输出文件，我需要一个静态 csv 文件名。

Is there way to rename this in pyspark?有没有办法在pyspark中重命名它？

## Blob Storage account information
storage_account_name = ""
storage_account_access_key = ""

## File location and File type
file_location = "path/.blob.core.windows.net/Databricks_Files/input"
file_location_new = "path/.blob.core.windows.net/Databricks_Files/out"
file_type = "csv"

## Connection string to connect to blob storage
spark.conf.set(
  "fs.azure.account.key."+storage_account_name+".blob.core.windows.net",
  storage_account_access_key)

Followed by outputting file after data transformation数据转换后输出文件

dfspark.coalesce(1).write.format('com.databricks.spark.csv') \
  .mode('overwrite').option("header", "true").save(file_location_new)

Where the file is then write as "part-00000-tid-336943946930983.....csv"然后将文件写为“part-00000-tid-336943946930983.....csv”

Where as a the goal is to have "Output.csv"目标是拥有“Output.csv”

Another approach I looked at was just recreating this in python but have not come across in the documentation yet of how to output the file back to blob storage.我看到的另一种方法只是在 python 中重新创建它，但在文档中还没有遇到如何将文件输出回 blob 存储的方法。

I know the method to retrieve from Blob storage is .get_blob_to_path via microsoft.docs我知道从 Blob 存储中检索的方法是.get_blob_to_path via microsoft.docs

Any help here is greatly appreciated.非常感谢这里的任何帮助。

Answer 1

Hadoop/Spark will parallel output the compute result per partition into one file, so you will see many part-<number>-.... files in a HDFS output path like Output/ named by you. Hadoop/Spark 会将每个分区的计算结果并行输出到一个文件中，因此您会在您命名的Output/ HDFS 输出路径中看到许多part-<number>-....文件。

If you want to output all results of a computing into one file, you can merge them via the command hadoop fs -getmerge /output1/part* /output2/Output.csv or set the number of reduce processes with 1 like using coalesce(1) function.如果你想将一次计算的所有结果输出到一个文件中，你可以通过命令hadoop fs -getmerge /output1/part* /output2/Output.csv合并它们，或者像使用coalesce(1) 1一样将reduce进程数设置为1 coalesce(1)功能。

So in your scenario, you only need to adjust the order of calling these functions to make the coalease function called at the front of save function, as below.所以在你的场景中，你只需要调整这些函数的调用顺序，让coalease函数在save函数的前面被调用，如下。

dfspark.write.format('com.databricks.spark.csv') \
  .mode('overwrite').option("header", "true").coalesce(1).save(file_location_new)

Answer 2

The coalesce and repartition do not help with saving the dataframe into 1 normally named file. coalesce和repartition无助于将数据帧保存到 1 个正常命名的文件中。

I ended up just renaming the 1 csv file and deleting the folder with log:我最终只是重命名了 1 个 csv 文件并删除了带有日志的文件夹：

def save_csv(df, location, filename):
  outputPath = os.path.join(location, filename + '_temp.csv')

  df.repartition(1).write.format("com.databricks.spark.csv").mode("overwrite").options(header="true", inferSchema="true").option("delimiter", "\t").save(outputPath)

  csv_files = os.listdir(os.path.join('/dbfs', outputPath))

  # moving the parquet-like temp csv file into normally named one
  for file in csv_files:
    if file[-4:] == '.csv':
      dbutils.fs.mv(os.path.join(outputPath,file) , os.path.join(location, filename))
      dbutils.fs.rm(outputPath, True)

# using save_csv
save_csv_location = 'mnt/.....'
save_csv(df, save_csv_location, 'name.csv')

在 azure blob 存储中重命名 spark 输出 csv

问题描述

2 个解决方案

解决方案1
3 已采纳 2018-12-13 09:18:53

解决方案2
3 2020-05-21 07:09:11

在 azure blob 存储中重命名 spark 输出 csv

问题描述

2 个解决方案

解决方案1 3 已采纳 2018-12-13 09:18:53

解决方案2 3 2020-05-21 07:09:11

解决方案1
3 已采纳 2018-12-13 09:18:53

解决方案2
3 2020-05-21 07:09:11