I am working on windows 10. I installed spark, and the goal is to use pyspark. I have made the following steps:
C:\Python37
C:\winutils\bin
C:\spark-3.0.0-preview2-bin-hadoop2.7
under system variables, I set following variables:
HADOOP_HOME
: C:\winutils
SPARK_HOME
: C:\spark-3.0.0-preview2-bin-hadoop2.7
JAVA_HOME
: C:\PROGRA~1\AdoptOpenJDK\jdk-8.0.242.08-hotspot
And finally, under system path, I added:
In the terminal:
So I would like to know why I am getting this warning:
unable to load native-hadoop library... And why I couldn't bind on port 4040...
Finally, inside Jupyter Notebook, I am getting the following error when trying to write into Parquet file. This image shows a working example, and the following one shows the code with errors:
And here is DataMaster__3.csv on my disk:
And the DaterMaster_par2222.parquet:
Any help is much appreciated!!
If you are writing the file in csv format, I have found that the best way to do that is using the following approach
LCL_POS.toPandas().to_csv(<path>)
There is another way to save it directly without converting to pandas but the issue is it ends up getting split into multiple files (with weird names so I tend to avoid those). If you are happy to split the file up, its much better to write a parquet file in my opinion.
LCL_POS.repartition(1).write.format("com.databricks.spark.csv").option("header", "true").save(<path>)
Hope that answers your question.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.