How to save a PySpark dataframe as a CSV with custom file name?

Question

Here is the spark DataFrame I want to save as a csv.

type(MyDataFrame)
--Output: <class 'pyspark.sql.dataframe.DataFrame'>

To save this as a CSV, I have the following code:

MyDataFrame.write.csv(csv_path, mode = 'overwrite', header = 'true')

When I save this, the file name is something like this:

part-0000-766dfdf-78fg-aa44-as3434rdfgfg-c000.csv

Is there a way I can give this a custom name while saving it? Like "MyDataFrame.csv"

Answer 1

No. That's how Spark work (at least for now). You'd have MyDataFrame.csv as a directory name , and under that directory, you'd have multiple files with the same format as part-0000-766dfdf-78fg-aa44-as3434rdfgfg-c000.csv , part-0000-766dfdf-78fg-aa44-as3434rdfgfg-c001.csv etc

It's not recommended, but if your data is small enough (arguably what is "small enough" here), you can always convert it to Pandas and save it to a single CSV file with any name you wanted.

Answer 2

I have the same requirement.You can write to one path, and then change the file path. This is my solution.

def write_to_hdfs_specify_path(df, spark, hdfs_path, file_name):
    """
    :param df: dataframe which you want to save
    :param spark: sparkSession
    :param hdfs_path: target path(shoul be not exises)
    :param file_name: csv file name
    :return: 
    """
    sc = spark.sparkContext
    Path = sc._gateway.jvm.org.apache.hadoop.fs.Path
    FileSystem = sc._gateway.jvm.org.apache.hadoop.fs.FileSystem
    Configuration = sc._gateway.jvm.org.apache.hadoop.conf.Configuration
    df.coalesce(1).write.option("header", True).option("delimiter", "|").option("compression", "none").csv(hdfs_path)
    fs = FileSystem.get(Configuration())
    file = fs.globStatus(Path("%s/part*" % hdfs_path))[0].getPath().getName()
    full_path = "%s/%s" % (hdfs_path, file_name)
    result = fs.rename(Path("%s/%s" % (hdfs_path, file)), Path(full_path))
    return result

Answer 3

how big is the file? One possibility would be to convert to pandas and save

Answer 4

.coalesce(1) will guarantee that there is only 1 file but will not guarantee file name. Please use some temp directory to save it and than rename it and copy (using dbutils.fs functions if you use databricks or using FileUtil from Hadoop API).

How to save a PySpark dataframe as a CSV with custom file name?

Question

3 answers

solution1
0 2021-10-19 18:28:29

solution2
0 2021-10-20 02:17:12

solution3
0 2021-10-20 16:11:23

solution4
0 2021-10-20 16:57:10

How to save a PySpark dataframe as a CSV with custom file name?

Question

3 answers

solution1 0 2021-10-19 18:28:29

solution2 0 2021-10-20 02:17:12

solution3 0 2021-10-20 16:11:23

solution4 0 2021-10-20 16:57:10

solution1
0 2021-10-19 18:28:29

solution2
0 2021-10-20 02:17:12

solution3
0 2021-10-20 16:11:23

solution4
0 2021-10-20 16:57:10