简体   繁体   中英

Writing dataframe to Azure Blob Storage using Databricks generates an empty file

On Databricks, I have a daily job running which writes a data frame to a parquet file on Azure Blob Storage. It creates one file and one folder with the same name. However, the file created on the directory I set it to is empty, and there is a folder with the same name containing a parquet file with "random name" that stores the dataframe's content.

empty_parquet_file random_name_parquet_file

I wanted the parquet file to be saved as the agg_DF.parquet, so that the file name stays constant, but agg_DF.parquet is empty right now. Instead, it looks like I'll need to go inside the folder and grab whatever file whose name ends with .parquet but I'd really appreciate any help on how to do this . Or is there a better way to do it on Databricks, so that the agg_DF.parquet is not empty when it's saved on Blob Storage.

Here's the code I have on Databricks:

OUTPUT_PARQUET_FILENAME = 'agg_DF.parquet'
container_name = ‘xxxxx’
account_name = ‘yyyy’
output_path = f"wasbs://{container_name}@{account_name}.blob.core.windows.net/{OUTPUT_PARQUET_FILENAME}"

spark_DF = spark.createDataFrame(agg_df).repartition('blob_date')
spark_DF.write.parquet(output_path, mode="overwrite")

The way Spark works is it splits a file in part files over executors to process it in distributed fashion parallelly. So you always get a part file in output while writing data whose name you cannot control and only the output folder name you can control

So overall the process you are following is what I would recommend that use repartition/coalesce (1) to put it in single file inside folder, get .parquet file from that folder move it using dbutils and delete the folder

Below is the code I use to perform same activity

df.coalesce(1).write.csv("/temp_path/","overwrite")
z=[ x.name for x in dbutils.fs.ls("/temp_path/") if x.name.endswith("csv")]
print(z)
if len(z)==1:
  print(z[0])
  dbutils.fs.mv("/temp_path/"+z[0],"/final_path/filename.parquet')
dbutils.fs.rm("/temp_path/",True)

Replace temp_path and final_path/filename.parquet

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM