I'm afraid its not gonna work like that because saving the data locally implies it must all be present on the driver. Per pyspark docs , the path
parameter in pyspark.sql.DataFrameWriter.csv
is a "path in any Hadoop supported file system" .
So as far as I can tell, there are several alternatives:
hdfs dfs -mget...
. This would be most straightforward and preferred way.df.collect()
to bring complete dataframe to the driver, and then write it to local FS. This might not be feasible for large dataframes, since it can crash the driver with OOM.df.toLocalIterator()
to bring data to the driver one partition at a time, and then write it to local FS. This avoids / lessens OOM chances presented by previous option.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.