简体   繁体   中英

Save only the required CSV file using PySpark

I am quite new to PySpark, I am trying to read and then save a CSV file using Azure Databricks.

After saving the file I see many other files like "_Committed","_Started","_Success" and finally the CSV file with a totally different name.

I have already checked using DataFrame repartition(1) and coalesce(1) but this only deals when the CSV file itself was partitioned by Spark. Is there anything that can be done using PySpark?

You can do the following:

df.toPandas().to_csv(path/to/file.csv)

It will create a single file csv as you expect.

Those are default Log files created when saving from PySpark . We can't eliminate this. Using coalesce(1) you can save in a single file without partition.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM