除了多线程之外，还可以从Spark 1.6，Scala 2.10.6进行HBase并发/并行扫描

Question

I have a list of rowPrefixes Array("a", "b", ...) 我有rowPrefixes Array("a", "b", ...)

I need to query HBase (using Nerdammer) for each of the rowPrefix. 我需要为每个rowPrefix查询HBase（使用Nerdammer）。 My current solution is 我当前的解决方案是

case class Data(x: String)

val rowPrefixes = Array("a", "b", "c")

rowPrefixes.par
    .map( rowPrefix => {
          val rdd = sc.hbaseTable[Data]("tableName")
            .inColumnFamily("columnFamily")
            .withStartRow(rowPrefix)

          rdd
        })
    .reduce(_ union _)

I was basically loading multiple rdd using multithreads (.par) and then unionizing all of them in the end. 我基本上是使用多线程（.par）加载多个rdd的，然后最后将它们全部合并。 Is there a better way to do this? 有一个更好的方法吗？ I don't mind using other library besides nerdammer. 除了nerdammer，我不介意使用其他库。

Besides, I'm worried about the reflection API threadsafe issue since I'm reading hbase into an RDD of case class. 此外，由于我正在将hbase读入case类的RDD中，因此我担心反射API线程安全问题。

Answer 1

I haven't used Nerdammer connector but if we consider your example of 4 prefix row key filters, using par the amount of parallelism would be limited, the cluster may go underutilized and results may be slow. 我没有使用过Nerdammer连接器，但是如果我们考虑您的4个前缀行键过滤器的示例，那么使用par并行性的数量将受到限制，集群可能会被利用不足，结果可能会很慢。

You can check if following can be achieved using Nerdammer connector, I have used hbase-spark connector (CDH), in below approach the row key prefixes will be scanned across all table partitions ie all the table regions spread across the cluster in parallel, which can utilize the available resources (cores/RAM) more efficiently and more importantly leverage the power of distributed computing. 您可以检查是否可以使用Nerdammer连接器实现以下目的，我已经使用了hbase-spark连接器（CDH），在以下方法中，行键前缀将在所有表分区上进行扫描，即所有表区域并行分布在整个群集中，可以更有效地利用可用资源（内核/ RAM），更重要的是可以利用分布式计算的功能。

val hbaseConf = HBaseConfiguration.create()
// set zookeeper quorum properties in hbaseConf

val hbaseContext = new HBaseContext(sc, hbaseConf)

val rowPrefixes = Array("a", "b", "c")
val filterList = new FilterList()

rowPrefixes.foreach { x => filterList.addFilter(new PrefixFilter(Bytes.toBytes(x))) }

var scan = new Scan()  

scan.setFilter(filterList)
scan.addFamily(Bytes.toBytes("myCF"));

val rdd = hbaseContext.hbaseRDD(TableName.valueOf("tableName"), scan)
rdd.mapPartitions(populateCaseClass)

In your case too full table scan will happen but only 4 partitions will do considerable amount of work, assuming you have sufficient cores available and par can allocate one core to each element in rowPrefixes array. 在您的情况下，将进行全表扫描，但假设您有足够的可用核心，并且par可以为rowPrefixes数组中的每个元素分配一个核心，那么只有4个分区会做大量工作。

Hope this helps. 希望这可以帮助。

除了多线程之外，还可以从Spark 1.6，Scala 2.10.6进行HBase并发/并行扫描

问题描述

1 个解决方案

解决方案1
0 已采纳 2018-04-23 12:02:14

除了多线程之外，还可以从Spark 1.6，Scala 2.10.6进行HBase并发/并行扫描

问题描述

1 个解决方案

解决方案1 0 已采纳 2018-04-23 12:02:14

解决方案1
0 已采纳 2018-04-23 12:02:14