writing a pandas dataframe(.csv) to local system or hdfs with spark in cluster mode

Question

I'm trying to write pandas data frame to the local system or to hdfs with spark in cluster mode but it's throwing an error like

IOError: [Errno 2] No such file or directory: {hdfs_path/file_name.txt}

This is how I'm writing

df.to_csv("hdfs_path/file_name.txt", sep="|")

I am using python and the job is running through a shell script.

This works fine if I'm in local mode but doesn't in yarn-cluster mode.

Any support is welcome and thanks in advance.

Answer 1

I have the same issue, i always convert the dataframe into a spark dataframe before creating a file on an Apache Spark filesystem :

df_sp = spark.createDataFrame(df_pd)
df_sp.coalesce(1).write.csv("my_file.csv", mode='overwrite', header = True)