Writing a dataframe in Parquet

Question

I am trying to read a json in spark and write it back as parquet. I am running my code in windows. Below is my code. After the execution it creates a folder called output_spark.parquet. And it also throws an error that the file is not found. If i create a file and then run the code it says that the file already exists. Here is the error i get.

py4j.protocol.Py4JJavaError: An error occurred while calling o34.parquet. : java.lang.RuntimeException: java.io.FileNotFoundException: java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset

Do i need file writer to write the parquet to the file? Appreciate any code snippets you might have.

    from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("Python Spark SQL basic example") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()

df = spark.read.json("Output.json")

df.show()

 
df.write.parquet("output_spark.parquet")

Answer 1

On Windows, Hadoop requires native code extensions so that it can integrate with the OS correctly for things like file access semantics and permissions. HOW TO FIX THIS?

get the WINUTILS.EXE binary from a Hadoop redistribution. use this link
Set the environment variable %HADOOP_HOME% to point to the directory above the BIN dir containing WINUTILS.EXE: well search for "EDIT USER VARIABLE" in the search bar in windows and then set it

Writing a dataframe in Parquet

Question

1 answers

solution1
1 ACCPTED 2021-03-03 23:55:14

Writing a dataframe in Parquet

Question

1 answers

solution1 1 ACCPTED 2021-03-03 23:55:14

solution1
1 ACCPTED 2021-03-03 23:55:14