Spark和HBase快照

Question

Under the assumption that we could access data much faster if pulling directly from HDFS instead of using the HBase API, we're trying to build an RDD based on a Table Snapshot from HBase. 假设我们可以直接从HDFS而不是使用HBase API来更快地访问数据，我们正在尝试基于HBase的表快照构建RDD。

So, I have a snapshot called "dm_test_snap". 所以，我有一个名为“dm_test_snap”的快照。 I seem to be able to get most of the configuration stuff working, but my RDD is null (despite there being data in the Snapshot itself). 我似乎能够使大多数配置工作正常，但我的RDD为空（尽管Snapshot本身存在数据）。

I'm having a hell of a time finding an example of anyone doing offline analysis of HBase snapshots with Spark, but I can't believe I'm alone in trying to get this working. 我有一段时间找到一个使用Spark对HBase快照进行离线分析的人的例子，但我无法相信我一个人试图让这个工作起来。 Any help or suggestions are greatly appreciated. 非常感谢任何帮助或建议。

Here is a snippet of my code: 这是我的代码片段：

object TestSnap  {
  def main(args: Array[String]) {
    val config = ConfigFactory.load()
    val hbaseRootDir =  config.getString("hbase.rootdir")
    val sparkConf = new SparkConf()
      .setAppName("testnsnap")
      .setMaster(config.getString("spark.app.master"))
      .setJars(SparkContext.jarOfObject(this))
      .set("spark.executor.memory", "2g")
      .set("spark.default.parallelism", "160")

    val sc = new SparkContext(sparkConf)

    println("Creating hbase configuration")
    val conf = HBaseConfiguration.create()

    conf.set("hbase.rootdir", hbaseRootDir)
    conf.set("hbase.zookeeper.quorum",  config.getString("hbase.zookeeper.quorum"))
    conf.set("zookeeper.session.timeout", config.getString("zookeeper.session.timeout"))
    conf.set("hbase.TableSnapshotInputFormat.snapshot.name", "dm_test_snap")

    val scan = new Scan
    val job = Job.getInstance(conf)

    TableSnapshotInputFormat.setInput(job, "dm_test_snap", 
        new Path("hdfs://nameservice1/tmp"))

    val hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableSnapshotInputFormat],
      classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],
      classOf[org.apache.hadoop.hbase.client.Result])

    hBaseRDD.count()

    System.exit(0)
  }

}

Update to include the solution The trick was, as @Holden mentioned below, the conf wasn't getting passed through. 更新以包含解决方案诀窍是，正如@Holden在下面提到的那样，conf没有通过。 To remedy this, I was able to get it working by changing this the call to newAPIHadoopRDD to this: 为了解决这个问题，我能够通过将对newAPIHadoopRDD的调用更改为：

val hBaseRDD = sc.newAPIHadoopRDD(job.getConfiguration, classOf[TableSnapshotInputFormat],
      classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],
      classOf[org.apache.hadoop.hbase.client.Result])

There was a second issue that was also highlighted by @victor's answer, that I was not passing in a scan. 第二个问题也被@ victor的回答强调，我没有通过扫描。 To fix that, I added this line and method: 为了解决这个问题，我添加了这一行和方法：

conf.set(TableInputFormat.SCAN, convertScanToString(scan))

 def convertScanToString(scan : Scan) = {
      val proto = ProtobufUtil.toScan(scan);
      Base64.encodeBytes(proto.toByteArray());
    }

This also let me pull out this line from the conf.set commands: 这也让我从conf.set命令中提取这一行：

conf.set("hbase.TableSnapshotInputFormat.snapshot.name", "dm_test_snap")

*NOTE: This was for HBase version 0.96.1.1 on CDH5.0 *注意：这是针对CDH5.0上的HBase版本0.96.1.1

Final full code for easy reference: 最终完整代码，以便于参考：

object TestSnap  {
  def main(args: Array[String]) {
    val config = ConfigFactory.load()
    val hbaseRootDir =  config.getString("hbase.rootdir")
    val sparkConf = new SparkConf()
      .setAppName("testnsnap")
      .setMaster(config.getString("spark.app.master"))
      .setJars(SparkContext.jarOfObject(this))
      .set("spark.executor.memory", "2g")
      .set("spark.default.parallelism", "160")

    val sc = new SparkContext(sparkConf)

    println("Creating hbase configuration")
    val conf = HBaseConfiguration.create()

    conf.set("hbase.rootdir", hbaseRootDir)
    conf.set("hbase.zookeeper.quorum",  config.getString("hbase.zookeeper.quorum"))
    conf.set("zookeeper.session.timeout", config.getString("zookeeper.session.timeout"))
    val scan = new Scan
    conf.set(TableInputFormat.SCAN, convertScanToString(scan))

    val job = Job.getInstance(conf)

    TableSnapshotInputFormat.setInput(job, "dm_test_snap", 
        new Path("hdfs://nameservice1/tmp"))

    val hBaseRDD = sc.newAPIHadoopRDD(job.getConfiguration, classOf[TableSnapshotInputFormat],
      classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],
      classOf[org.apache.hadoop.hbase.client.Result])

    hBaseRDD.count()

    System.exit(0)
  }

  def convertScanToString(scan : Scan) = {
      val proto = ProtobufUtil.toScan(scan);
      Base64.encodeBytes(proto.toByteArray());
  }

}

Answer 1

Looking at the Job information, its making a copy of the conf object you are supplying to it ( The Job makes a copy of the Configuration so that any necessary internal modifications do not reflect on the incoming parameter. ) so most likely the information that you need to set on the conf object isn't getting passed down to Spark. 查看作业信息，它会复制您提供给它的conf对象（ The Job makes a copy of the Configuration so that any necessary internal modifications do not reflect on the incoming parameter. ）所以很可能是您的信息需要设置conf对象不会传递给Spark。 You could instead use TableSnapshotInputFormatImpl which has a similar method that works on conf objects. 您可以改为使用TableSnapshotInputFormatImpl ，它具有与conf对象相似的类似方法。 There might be additional things needed but at first pass through the problem this seems like the most likely cause. 可能还需要其他的东西，但首先要解决问题，这似乎是最可能的原因。

As pointed out in the comments, another option is to use job.getConfiguration to get the updated config from the job object. 正如评论中指出的，另一个选择是使用job.getConfiguration从作业对象获取更新的配置。

Answer 2

You have not configured your M/R job properly: This is example in Java on how to configure M/R over snapshots: 您尚未正确配置M / R作业：这是Java中有关如何在快照上配置M / R的示例：

Job job = new Job(conf);
Scan scan = new Scan();
TableMapReduceUtil.initTableSnapshotMapperJob(snapshotName,
       scan, MyTableMapper.class, MyMapKeyOutput.class,
       MyMapOutputValueWritable.class, job, true);
}

You, definitely, skipped Scan. 你肯定是跳过扫描。 I suggest you taking a look at TableMapReduceUtil's initTableSnapshotMapperJob implementation to get idea how to configure job in Spark/Scala. 我建议你看一下TableMapReduceUtil的initTableSnapshotMapperJob实现，以了解如何在Spark / Scala中配置作业。

Answer 3

Here is complete configuration in mapreduce Java 这是mapreduce Java中的完整配置

TableMapReduceUtil.initTableSnapshotMapperJob(snapshotName, // Name of the snapshot
                scan, // Scan instance to control CF and attribute selection
                DefaultMapper.class, // mapper class
                NullWritable.class, // mapper output key
                Text.class, // mapper output value
                job,
                true,
                restoreDir);

Spark和HBase快照

问题描述

3 个解决方案

解决方案1
3 已采纳 2015-06-11 00:43:46

解决方案2
1 2015-06-12 03:40:08

解决方案3
0 2017-06-01 10:10:27

Spark和HBase快照

问题描述

3 个解决方案

解决方案1 3 已采纳 2015-06-11 00:43:46

解决方案2 1 2015-06-12 03:40:08

解决方案3 0 2017-06-01 10:10:27

解决方案1
3 已采纳 2015-06-11 00:43:46

解决方案2
1 2015-06-12 03:40:08

解决方案3
0 2017-06-01 10:10:27