简体   繁体   中英

hdfs put/moveFromLocal not distributing data across data nodes?

I found similar question Hadoop HDFS is not distributing blocks of data evenly

but my ask is when replication factor = 1

I still want to understand why HDFS is not evenly distributing file blocks across the cluster nodes? This will result in data skew from start, when I load/run dataframe ops on such files. Am I missing something?

Even if replication factor is one, files are still split and stored in multiples of the HDFS block size. Block placement is on best effort, AFAIK, not purely balanced; replication placement of 3 picks a random node, then another node on the same rack, then another node off rack at random

You'll need to clarify how large your files are and where you are looking to see if data is being split

Note: not all file formats are splittable

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM