将Spark df存储到HBase

Question

我正在尝试以有效方式将Spark数据集存储到HBase。 当我们尝试使用JAVA中的lambda做类似的事情时：

sparkDF.foreach(l->this.hBaseConnector.persistMappingToHBase(l,"name_of_hBaseTable") );

该函数persistMappingtoHBase使用HBase Java客户端（Put）存储在HBase中。

I get an exception: Exception in thread "main"  org.apache.spark.SparkException: Task not serializable

然后我们尝试了这个：

sparkDF.foreachPartition(partition -> {
    final HBaseConnector hBaseConnector = new HBaseConnector();
    hBaseConnector.connect(hbaseProps);
    while (partition.hasNext()) {
        hBaseConnector.persistMappingToHBase(partition.next());
    }
    hBaseConnector.closeConnection();
});

我想这似乎可行，但效率似乎很低，因为我们为数据帧的每一行创建并关闭了一个连接。

将Spark DS存储到HBase的好方法是什么？ 我看到了IBM开发的连接器，但从未使用过。

Answer 1

以下内容可用于将内容保存到HBase

val hbaseConfig = HBaseConfiguration.create
hbaseConfig.set("hbase.zookeeper.quorum", "xx.xxx.xxx.xxx")
hbaseConfig.set("hbase.zookeeper.property.clientPort", "2181")

val job = Job.getInstance(hbaseConfig)
job.setOutputFormatClass(classOf[TableOutputFormat[_]])
job.getConfiguration.set(TableOutputFormat.OUTPUT_TABLE, "test_table")

val result = sparkDF.map(row -> {
    //  Using UUID as my rowkey, you can use your own rowkey
    val put = new Put(Bytes.toBytes(UUID.randomUUID().toString))

    //  setting the value of each row to Put object
    ....
    ....

    new Tuple2[ImmutableBytesWritable, Put](new ImmutableBytesWritable(), put)
});

//  save result to hbase table
result.saveAsNewAPIHadoopDataset(job.getConfiguration)

我的build.sbt文件中有以下依赖build.sbt

libraryDependencies += "org.apache.hbase" % "hbase-common" % "1.3.0"
libraryDependencies += "org.apache.hbase" % "hbase-client" % "1.3.0"
libraryDependencies += "org.apache.hbase" % "hbase-server" % "1.3.0"

将Spark df存储到HBase

问题描述

1 个解决方案

解决方案1
0 2017-11-27 09:44:21

将Spark df存储到HBase

问题描述

1 个解决方案

解决方案1 0 2017-11-27 09:44:21

解决方案1
0 2017-11-27 09:44:21