简体   繁体   中英

Writing a big Spark Dataframe into a csv file

I'm using Spark 2.3 and I need to save a Spark Dataframe into a csv file and I'm looking for a better way to do it.. looking over related/similar questions, I found this one , but I need a more specific:

If the DataFrame is too big, how can I avoid using Pandas? Because I used toCSV() function (code below) and it produced:

Out Of Memory error (could not allocate memory).

Is directly writing to a csv using file I/O a better way? Can it preserve the separators?

Using df.coalesce(1).write.option("header", "true").csv('mycsv.csv') will cause the header to be written in each file and when the files are merged, it will have headers in the middle. Am I wrong?

Using spark write and then hadoop getmerge is better than using coalesce from the point of performance?

def toCSV(spark_df, n=None, save_csv=None, csv_sep=',', csv_quote='"'):
        """get spark_df from hadoop and save to a csv file

        Parameters
        ----------
        spark_df: incoming dataframe
        n: number of rows to get
        save_csv=None: filename for exported csv

        Returns
        -------

        """

        # use the more robust method
        # set temp names
        tmpfilename = save_csv or (wfu.random_filename() + '.csv')
        tmpfoldername = wfu.random_filename()
        print n
        # write sparkdf to hadoop, get n rows if specified
        if n:
            spark_df.limit(n).write.csv(tmpfoldername, sep=csv_sep, quote=csv_quote)
        else:
            spark_df.write.csv(tmpfoldername, sep=csv_sep, quote=csv_quote)

        # get merge file from hadoop
        HDFSUtil.getmerge(tmpfoldername, tmpfilename)
        HDFSUtil.rmdir(tmpfoldername)

        # read into pandas df, remove tmp csv file
        pd_df = pd.read_csv(tmpfilename, names=spark_df.columns, sep=csv_sep, quotechar=csv_quote)
        os.remove(tmpfilename)

        # re-write the csv file with header!
        if save_csv is not None:
            pd_df.to_csv(save_csv, sep=csv_sep, quotechar=csv_quote)

If the DataFrame is too big, how can I avoid using Pandas?

You can just save the file to HDFS or S3 or whichever distributed storage you have.

Is directly writing to a csv using file I/O a better way? Can it preserve the separators?

If you mean by that to save file to local storage - it will still cause OOM exception, since you will need to move all data in memory on local machine to do it.

Using df.coalesce(1).write.option("header", "true").csv('mycsv.csv') will cause the header to be written in each file and when the files are merged, it will have headers in the middle. Am I wrong?

In this case you will have only 1 file (since you do coalesce(1) ). So you don't need to care about headers. Instead - you should care about memory on the executors - you might get OOM on the executor since all the data will be moved to that executor.

Using spark write and then hadoop getmerge is better than using coalesce from the point of performance?

Definitely better (but don't use coalesce() ). Spark will efficiently write data to storage, then HDFS will duplicate data and after that getmerge will be able to efficiently read data from the nodes and merge it.

We used databricks library . It works fine

df.save("com.databricks.spark.csv", SaveMode.Overwrite, Map("delimiter" -> delim, "nullValue" -> "-", "path" -> tempFPath))

Library :

<!-- spark df to csv -->
    <dependency>
        <groupId>com.databricks</groupId>
        <artifactId>spark-csv_2.10</artifactId>
        <version>1.3.0</version>
    </dependency>

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM