简体   繁体   中英

Spark write to parquet on hdfs

I have 3 nodes hadoop and spark installed. I would like to take data from rdbms into data frame and write this data into parquet on HDFS. "dfs.replication" value is 1 .

When i try this with following command i have seen all HDFS blocks are located on node which i executed spark-shell.

scala> xfact.write.parquet("hdfs://sparknode01.localdomain:9000/xfact")

Is this the intended behaviour or should all blocks be distributed across the cluster?

Thanks

Since you are writing your data to HDFS this does not depend on spark, but on HDFS. From Hadoop : Definitive Guide

Hadoop's default strategy is to place the first replica on the same node as the client (for clients running outside the cluster, a node is chosen at random, although the system tries not to pick nodes that are too full or too busy).

So yes, this is the intended behaivour.

Just as @nik says, I do my work with multi cients and it done for me:

This is the python snippet:

columns = xfact.columns test = sqlContext.createDataFrame(xfact.rdd.map(lambda a: a),columns) test.write.mode('overwrite').parquet('hdfs://sparknode01.localdomain:9000/xfact')

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM