Spark write to parquet on hdfs

Question

I have 3 nodes hadoop and spark installed. I would like to take data from rdbms into data frame and write this data into parquet on HDFS. "dfs.replication" value is 1 .

When i try this with following command i have seen all HDFS blocks are located on node which i executed spark-shell.

scala> xfact.write.parquet("hdfs://sparknode01.localdomain:9000/xfact")

Is this the intended behaviour or should all blocks be distributed across the cluster?

Thanks

Answer 1

Since you are writing your data to HDFS this does not depend on spark, but on HDFS. From Hadoop : Definitive Guide

Hadoop's default strategy is to place the first replica on the same node as the client (for clients running outside the cluster, a node is chosen at random, although the system tries not to pick nodes that are too full or too busy).

So yes, this is the intended behaivour.

Answer 2

Just as @nik says, I do my work with multi cients and it done for me:

This is the python snippet:

columns = xfact.columns test = sqlContext.createDataFrame(xfact.rdd.map(lambda a: a),columns) test.write.mode('overwrite').parquet('hdfs://sparknode01.localdomain:9000/xfact')

Spark write to parquet on hdfs

Question

2 answers

solution1
3 ACCPTED 2016-11-04 09:44:56

solution2
0 2018-01-16 02:24:46

Spark write to parquet on hdfs

Question

2 answers

solution1 3 ACCPTED 2016-11-04 09:44:56

solution2 0 2018-01-16 02:24:46

solution1
3 ACCPTED 2016-11-04 09:44:56

solution2
0 2018-01-16 02:24:46