Error writing to parquet file using pyspark

Question

I am working on windows 10. I installed spark, and the goal is to use pyspark. I have made the following steps:

I have installed Python 3.7 with anaconda -- Python was added to C:\Python37
I download wintils from this link -- winutils is added to C:\winutils\bin
I downloaded spark -- spark was extracted is: C:\spark-3.0.0-preview2-bin-hadoop2.7
I downloaded Java 8 from AdoptOpenJDK

under system variables, I set following variables:

HADOOP_HOME : C:\winutils
SPARK_HOME : C:\spark-3.0.0-preview2-bin-hadoop2.7
JAVA_HOME : C:\PROGRA~1\AdoptOpenJDK\jdk-8.0.242.08-hotspot

And finally, under system path, I added:

%JAVA_HOME%\bin
%SPARK_HOME%\bin
%HADOOP_HOME%\bin

In the terminal:

So I would like to know why I am getting this warning:

unable to load native-hadoop library... And why I couldn't bind on port 4040...

Finally, inside Jupyter Notebook, I am getting the following error when trying to write into Parquet file. This image shows a working example, and the following one shows the code with errors:

And here is DataMaster__3.csv on my disk:

And the DaterMaster_par2222.parquet:

Any help is much appreciated!!

Answer 1

If you are writing the file in csv format, I have found that the best way to do that is using the following approach

LCL_POS.toPandas().to_csv(<path>)

There is another way to save it directly without converting to pandas but the issue is it ends up getting split into multiple files (with weird names so I tend to avoid those). If you are happy to split the file up, its much better to write a parquet file in my opinion.

LCL_POS.repartition(1).write.format("com.databricks.spark.csv").option("header", "true").save(<path>)

Hope that answers your question.

Error writing to parquet file using pyspark

Question

1 answers

solution1
0 2020-04-17 00:27:25

Error writing to parquet file using pyspark

Question

1 answers

solution1 0 2020-04-17 00:27:25

solution1
0 2020-04-17 00:27:25