简体   繁体   中英

Writing a dataframe in Parquet

I am trying to read a json in spark and write it back as parquet. I am running my code in windows. Below is my code. After the execution it creates a folder called output_spark.parquet. And it also throws an error that the file is not found. If i create a file and then run the code it says that the file already exists. Here is the error i get.

py4j.protocol.Py4JJavaError: An error occurred while calling o34.parquet. : java.lang.RuntimeException: java.io.FileNotFoundException: java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset

Do i need file writer to write the parquet to the file? Appreciate any code snippets you might have.

    from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("Python Spark SQL basic example") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()

df = spark.read.json("Output.json")

df.show()

 
df.write.parquet("output_spark.parquet")

On Windows, Hadoop requires native code extensions so that it can integrate with the OS correctly for things like file access semantics and permissions. HOW TO FIX THIS?

  1. get the WINUTILS.EXE binary from a Hadoop redistribution. use this link

  2. Set the environment variable %HADOOP_HOME% to point to the directory above the BIN dir containing WINUTILS.EXE: well search for "EDIT USER VARIABLE" in the search bar in windows and then set it用户变量

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM