简体   繁体   中英

Error writing to parquet file using pyspark

I am working on windows 10. I installed spark, and the goal is to use pyspark. I have made the following steps:

  1. I have installed Python 3.7 with anaconda -- Python was added to C:\Python37
  2. I download wintils from this link -- winutils is added to C:\winutils\bin
  3. I downloaded spark -- spark was extracted is: C:\spark-3.0.0-preview2-bin-hadoop2.7
  4. I downloaded Java 8 from AdoptOpenJDK

under system variables, I set following variables:

  1. HADOOP_HOME : C:\winutils
  2. SPARK_HOME : C:\spark-3.0.0-preview2-bin-hadoop2.7
  3. JAVA_HOME : C:\PROGRA~1\AdoptOpenJDK\jdk-8.0.242.08-hotspot

And finally, under system path, I added:

  1. %JAVA_HOME%\bin
  2. %SPARK_HOME%\bin
  3. %HADOOP_HOME%\bin

In the terminal:

在此处输入图像描述

在此处输入图像描述

So I would like to know why I am getting this warning:

unable to load native-hadoop library... And why I couldn't bind on port 4040...

Finally, inside Jupyter Notebook, I am getting the following error when trying to write into Parquet file. This image shows a working example, and the following one shows the code with errors:

在此处输入图像描述

And here is DataMaster__3.csv on my disk:

在此处输入图像描述

And the DaterMaster_par2222.parquet:

在此处输入图像描述

在此处输入图像描述

Any help is much appreciated!!

If you are writing the file in csv format, I have found that the best way to do that is using the following approach

LCL_POS.toPandas().to_csv(<path>)

There is another way to save it directly without converting to pandas but the issue is it ends up getting split into multiple files (with weird names so I tend to avoid those). If you are happy to split the file up, its much better to write a parquet file in my opinion.

LCL_POS.repartition(1).write.format("com.databricks.spark.csv").option("header", "true").save(<path>)

Hope that answers your question.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM