简体   繁体   English

在 azure blob 存储中重命名 spark 输出 csv

[英]Renaming spark output csv in azure blob storage

I have a Databricks notebook setup that works as the following;我有一个 Databricks 笔记本设置,其工作方式如下;

  • pyspark connection details to Blob storage account到 Blob 存储帐户的 pyspark 连接详细信息
  • Read file through spark dataframe通过 spark dataframe 读取文件
  • convert to pandas Df转换为熊猫 Df
  • data modelling on pandas Df pandas Df 数据建模
  • convert to spark Df转换为火花 Df
  • write to blob storage in single file在单个文件中写入 blob 存储

My problem is, that you can not name the file output file, where I need a static csv filename.我的问题是,您无法命名文件输出文件,我需要一个静态 csv 文件名。

Is there way to rename this in pyspark?有没有办法在pyspark中重命名它?

## Blob Storage account information
storage_account_name = ""
storage_account_access_key = ""

## File location and File type
file_location = "path/.blob.core.windows.net/Databricks_Files/input"
file_location_new = "path/.blob.core.windows.net/Databricks_Files/out"
file_type = "csv"

## Connection string to connect to blob storage
spark.conf.set(
  "fs.azure.account.key."+storage_account_name+".blob.core.windows.net",
  storage_account_access_key)

Followed by outputting file after data transformation数据转换后输出文件

dfspark.coalesce(1).write.format('com.databricks.spark.csv') \
  .mode('overwrite').option("header", "true").save(file_location_new)

Where the file is then write as "part-00000-tid-336943946930983.....csv"然后将文件写为“part-00000-tid-336943946930983.....csv”

Where as a the goal is to have "Output.csv"目标是拥有“Output.csv”

Another approach I looked at was just recreating this in python but have not come across in the documentation yet of how to output the file back to blob storage.我看到的另一种方法只是在 python 中重新创建它,但在文档中还没有遇到如何将文件输出回 blob 存储的方法。

I know the method to retrieve from Blob storage is .get_blob_to_path via microsoft.docs我知道从 Blob 存储中检索的方法是.get_blob_to_path via microsoft.docs

Any help here is greatly appreciated.非常感谢这里的任何帮助。

Hadoop/Spark will parallel output the compute result per partition into one file, so you will see many part-<number>-.... files in a HDFS output path like Output/ named by you. Hadoop/Spark 会将每个分区的计算结果并行输出到一个文件中,因此您会在您命名的Output/ HDFS 输出路径中看到许多part-<number>-....文件。

If you want to output all results of a computing into one file, you can merge them via the command hadoop fs -getmerge /output1/part* /output2/Output.csv or set the number of reduce processes with 1 like using coalesce(1) function.如果你想将一次计算的所有结果输出到一个文件中,你可以通过命令hadoop fs -getmerge /output1/part* /output2/Output.csv合并它们,或者像使用coalesce(1) 1一样将reduce进程数设置为1 coalesce(1)功能。

So in your scenario, you only need to adjust the order of calling these functions to make the coalease function called at the front of save function, as below.所以在你的场景中,你只需要调整这些函数的调用顺序,让coalease函数在save函数的前面被调用,如下。

dfspark.write.format('com.databricks.spark.csv') \
  .mode('overwrite').option("header", "true").coalesce(1).save(file_location_new)

The coalesce and repartition do not help with saving the dataframe into 1 normally named file. coalescerepartition无助于将数据帧保存到 1 个正常命名的文件中。

I ended up just renaming the 1 csv file and deleting the folder with log:我最终只是重命名了 1 个 csv 文件并删除了带有日志的文件夹:

def save_csv(df, location, filename):
  outputPath = os.path.join(location, filename + '_temp.csv')

  df.repartition(1).write.format("com.databricks.spark.csv").mode("overwrite").options(header="true", inferSchema="true").option("delimiter", "\t").save(outputPath)

  csv_files = os.listdir(os.path.join('/dbfs', outputPath))

  # moving the parquet-like temp csv file into normally named one
  for file in csv_files:
    if file[-4:] == '.csv':
      dbutils.fs.mv(os.path.join(outputPath,file) , os.path.join(location, filename))
      dbutils.fs.rm(outputPath, True)

# using save_csv
save_csv_location = 'mnt/.....'
save_csv(df, save_csv_location, 'name.csv')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 重命名放置在 Azure Blob 存储中的 csv 文件 - Renaming a csv file placed in Azure Blob Storage 运行Spark群集的Azure Databricks中需要Azure Blob存储 - Need for Azure Blob Storage in Azure Databricks running Spark clusters 如何使用 azure ZC1C425Z574E68385D1CAB1 编辑 azure blob 存储中的 *.csv 文件 - How to edit a *.csv file in the azure blob storage with an azure function? 将 csv 个文件从 azure blob 存储 A 复制到 azure blob 存储 B - copy csv files from azure blob storage A to azure blob storage B 使用python上载csv文件以使Blob存储蔚蓝 - Uploading csv file using python to azure blob storage 从 Azure blob Storage 读取 csv 并存储在 DataFrame - Read csv from Azure blob Storage and store in a DataFrame 将缓冲的 csv 从 pandas 直接上传到 azure blob 存储 - Upload buffered csv from pandas directly to azure blob storage 使用 Python 从 azure blob 存储下载文件(csv、excel) - Download files (csv, excel) from azure blob storage using Python 使用 python 创建 csv 文件并将其上传到 azure blob 存储 - Create and upload csv file to azure blob storage using python 将 Azure Blob 存储中的新 CSV 加载到 SQL DB - Loading a new CSV in Azure Blob Storage to SQL DB
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM