使用 Spark 和 JAVA 從 HBase 讀取數據

Question

我想使用 JAVA 通過 Spark 訪問 HBase。 除了這個，我還沒有找到任何例子。 答案中寫着，

你也可以用Java寫這個

我從How to read from hbase using spark復制了這段代碼：

import org.apache.hadoop.hbase.client.{HBaseAdmin, Result}
import org.apache.hadoop.hbase.{ HBaseConfiguration, HTableDescriptor }
import org.apache.hadoop.hbase.mapreduce.TableInputFormat
import org.apache.hadoop.hbase.io.ImmutableBytesWritable

import org.apache.spark._

object HBaseRead {
  def main(args: Array[String]) {
    val sparkConf = new SparkConf().setAppName("HBaseRead").setMaster("local[2]")
    val sc = new SparkContext(sparkConf)
    val conf = HBaseConfiguration.create()
    val tableName = "table1"

    System.setProperty("user.name", "hdfs")
    System.setProperty("HADOOP_USER_NAME", "hdfs")
    conf.set("hbase.master", "localhost:60000")
    conf.setInt("timeout", 120000)
    conf.set("hbase.zookeeper.quorum", "localhost")
    conf.set("zookeeper.znode.parent", "/hbase-unsecure")
    conf.set(TableInputFormat.INPUT_TABLE, tableName)

    val admin = new HBaseAdmin(conf)
    if (!admin.isTableAvailable(tableName)) {
      val tableDesc = new HTableDescriptor(tableName)
      admin.createTable(tableDesc)
    }

    val hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat], classOf[ImmutableBytesWritable], classOf[Result])
    println("Number of Records found : " + hBaseRDD.count())
    sc.stop()
  }
}

誰能給我一些提示如何找到正確的依賴項、對象和東西？

似乎HBaseConfiguration在hbase-client中，但我實際上堅持使用TableInputFormat.INPUT_TABLE 。 這不應該在同一個依賴項中嗎？

有沒有更好的方法來使用 spark 訪問 hbase？

Answer 1

TableInputFormat類位於 hbase-server.jar 中，您需要在 pom.xml 中添加該依賴項。 請在 Spark 用戶列表中檢查HBase 和不存在的 TableInputFormat 。

<dependency>
    <groupId>org.apache.hbase</groupId>
    <artifactId>hbase-server</artifactId>
    <version>1.3.0</version>
</dependency>

下面是使用 Spark 從 Hbase 讀取的示例代碼。

public static void main(String[] args) throws Exception {
    SparkConf sparkConf = new SparkConf().setAppName("HBaseRead").setMaster("local[*]");
    JavaSparkContext jsc = new JavaSparkContext(sparkConf);
    Configuration hbaseConf = HBaseConfiguration.create();
    hbaseConf.set(TableInputFormat.INPUT_TABLE, "my_table");
    JavaPairRDD<ImmutableBytesWritable, Result> javaPairRdd = jsc.newAPIHadoopRDD(hbaseConf, TableInputFormat.class,ImmutableBytesWritable.class, Result.class);
    jsc.stop();
  }
}

Answer 2

是的。 有。 使用 Cloudera 的SparkOnHbase 。

<dependency>
    <groupId>org.apache.hbase</groupId>
    <artifactId>hbase-spark</artifactId>
    <version>1.2.0-cdh5.7.0</version>
</dependency>

並且使用 HBase 掃描從 HBase 表中讀取數據（如果您知道要檢索的行的鍵，則使用 Bulk Get）。

Configuration conf = HBaseConfiguration.create();
conf.addResource(new Path("/etc/hbase/conf/core-site.xml"));
conf.addResource(new Path("/etc/hbase/conf/hbase-site.xml"));
JavaHBaseContext hbaseContext = new JavaHBaseContext(jsc, conf);

Scan scan = new Scan();
scan.setCaching(100);

JavaRDD<Tuple2<byte[], List<Tuple3<byte[], byte[], byte[]>>>> hbaseRdd = hbaseContext.hbaseRDD(tableName, scan);

System.out.println("Number of Records found : " + hBaseRDD.count())

使用 Spark 和 JAVA 從 HBase 讀取數據

問題描述

2 個解決方案

解決方案1
0 2017-02-21 16:35:10

解決方案2
0 2017-03-02 21:01:50

使用 Spark 和 JAVA 從 HBase 讀取數據

問題描述

2 個解決方案

解決方案1 0 2017-02-21 16:35:10

解決方案2 0 2017-03-02 21:01:50

解決方案1
0 2017-02-21 16:35:10

解決方案2
0 2017-03-02 21:01:50