HBase Concurrent / Parallel Scan from Spark 1.6, Scala 2.10.6 besides multithreading

Question

I have a list of rowPrefixes Array("a", "b", ...)

I need to query HBase (using Nerdammer) for each of the rowPrefix. My current solution is

case class Data(x: String)

val rowPrefixes = Array("a", "b", "c")

rowPrefixes.par
    .map( rowPrefix => {
          val rdd = sc.hbaseTable[Data]("tableName")
            .inColumnFamily("columnFamily")
            .withStartRow(rowPrefix)

          rdd
        })
    .reduce(_ union _)

I was basically loading multiple rdd using multithreads (.par) and then unionizing all of them in the end. Is there a better way to do this? I don't mind using other library besides nerdammer.

Besides, I'm worried about the reflection API threadsafe issue since I'm reading hbase into an RDD of case class.

Answer 1

I haven't used Nerdammer connector but if we consider your example of 4 prefix row key filters, using par the amount of parallelism would be limited, the cluster may go underutilized and results may be slow.

You can check if following can be achieved using Nerdammer connector, I have used hbase-spark connector (CDH), in below approach the row key prefixes will be scanned across all table partitions ie all the table regions spread across the cluster in parallel, which can utilize the available resources (cores/RAM) more efficiently and more importantly leverage the power of distributed computing.

val hbaseConf = HBaseConfiguration.create()
// set zookeeper quorum properties in hbaseConf

val hbaseContext = new HBaseContext(sc, hbaseConf)

val rowPrefixes = Array("a", "b", "c")
val filterList = new FilterList()

rowPrefixes.foreach { x => filterList.addFilter(new PrefixFilter(Bytes.toBytes(x))) }

var scan = new Scan()  

scan.setFilter(filterList)
scan.addFamily(Bytes.toBytes("myCF"));

val rdd = hbaseContext.hbaseRDD(TableName.valueOf("tableName"), scan)
rdd.mapPartitions(populateCaseClass)

In your case too full table scan will happen but only 4 partitions will do considerable amount of work, assuming you have sufficient cores available and par can allocate one core to each element in rowPrefixes array.

Hope this helps.

HBase Concurrent / Parallel Scan from Spark 1.6, Scala 2.10.6 besides multithreading

Question

1 answers

solution1
0 ACCPTED 2018-04-23 12:02:14

HBase Concurrent / Parallel Scan from Spark 1.6, Scala 2.10.6 besides multithreading

Question

1 answers

solution1 0 ACCPTED 2018-04-23 12:02:14

solution1
0 ACCPTED 2018-04-23 12:02:14