简体   繁体   中英

Ensuring Data Locality on hbase

Is there a way to ensure that my spark executors are co located with my Hbase region servers ? In the Spark-on-HBase by horton works it is mentioned as below :

We assume Spark and HBase are deployed in the same cluster, and Spark executors are co-located with region servers

Is there a way of achieving the same ? If i use

sparkContext.newHadoopApi()

will it ensure data locality ??

Our experience at Splice Machine was that for large queries and systems where many analytical queries and compactions were running in Spark, we would achieve decent locality(95+%). We are using Spark on Yarn where the executors are dynamically allocated and shrink after a certain time period of inactivity. There were a couple of issues we had to work around however.

  1. Single Spark Context across operations. We built a single spark context server so all of our queries could have effective resource management and locality. If you create many Spark Contexts, the executor resources can be blocked from executing on certain nodes.

  2. If someone runs a medium sized query after a period of inactivity, there is a strong likelihood that all nodes where data resides will not have executors dynamically allocated.

  3. We leaned heavily on our implementation of Compactions On Spark and an approach of reading store files directly from Spark (vs. Hbase remote scan) with incremental deltas from the HBase memstore. The compactions created better locality and demand on the executor tasks. The direct reading of store files allowed for locality to be based on file locality (2-3 Copies) vs. being local to the single region server (only 1 server).

  4. We wrote our own Split Mechanism because the default hbase split by region size caused long tails and serious memory issues. For example, we would have a table with 6 regions ranging from (20M to 4 Gigs). The 4 gig region would be the long tail. Certain Spark operations would expect an executor to be able to load the entire 4 gig into memory causing memory issues. With our own split mechanism, we in essence put a bound on the amount of data to scan and put in memory...

If you need more detail about what we did, check out

http://community.splicemachine.com/

We are open source and you can check out our code at

https://github.com/splicemachine/spliceengine

Good luck Vidya...

Cheers, John

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM