How to load parquet files from hadoopish folder

Question

If I save data frame this way in Java, ...:

df.write().parquet("myTest.parquet");

..., then it gets saved in a hadoopish way (a folder with numerous files).

Is it possible to save data frame as a single file? I tried collect() , but it does not help.

If it's impossible, then my question is how should I change the Python code for reading Parquet files from hadoopish folder created by df.write().parquet("myTest.parquet") :

load_df = sqlContext.read.parquet("myTest.parquet").where('field1="aaa"').select('field2', 'field3').coalesce(64)

Answer 1

Is it possible to save data frame as a single file?

Yes, but you should not as you may put too much pressure on a single JVM that can lead not only to performance degradation but also to JVM termination and hence the entire Spark application failure.

So, yes, it's possible and you should repartition(1) to have a single partition:

repartition(numPartitions: Int): Dataset[T] Returns a new Dataset that has exactly numPartitions partitions.

how should I change the Python code for reading Parquet files from hadoopish folder

Loading the dataset from as you called it a "hadoopish" folder is to not be concerned with the internal structure at all, and consider it a single file (that is a directory under the covers).

That's an internal representation of how the files is stored and does not impact the code to load it.

Answer 2

Spark writes your files in a directory, this files in numerous as you say and if the writing operation success it saves another empty file called _SUCCESS

I'm coming from scala but I do believe that there's a similar way in python

Save and read your files in parquet or json or whatever format you want is straightforward :

df.write.parquet("path")
loaddf = spark.read.parquet("path")

I tried collect(), but it does not help.

Talking about collect , it is not a good practice to use it in such operations because it returns your data to driver so you will lose the parallel computation benefits, and it will cause an OutOfMemoryException if the data can't fit in the memory

Is it possible to save data frame as a single file?

You really don't need to do that in major cases, if so, use the repartition(1) method on your Dataframe before saving it

Hope it helps, Best Regards

How to load parquet files from hadoopish folder

Question

2 answers

solution1
1 2017-05-21 11:04:54

solution2
1 ACCPTED 2017-05-21 11:11:57

How to load parquet files from hadoopish folder

Question

2 answers

solution1 1 2017-05-21 11:04:54

solution2 1 ACCPTED 2017-05-21 11:11:57

solution1
1 2017-05-21 11:04:54

solution2
1 ACCPTED 2017-05-21 11:11:57