使用结构化 Spark Streaming 在 HBase 中批量插入数据

Question

I'm reading data coming from a Kafka (100.000 line per second) using Structured Spark Streaming, and i'm trying to insert all the data in HBase.我正在使用结构化 Spark Streaming 读取来自 Kafka（每秒 100.000 行）的数据，并且我试图将所有数据插入 HBase。

I'm in Cloudera Hadoop 2.6 and I'm using Spark 2.3我在 Cloudera Hadoop 2.6 中使用 Spark 2.3

I tried something like I've seen here .我尝试了类似我在这里看到的东西。

eventhubs.writeStream
 .foreach(new MyHBaseWriter[Row])
 .option("checkpointLocation", checkpointDir)
 .start()
 .awaitTermination()

MyHBaseWriter looks like this : MyHBaseWriter 看起来像这样：

class AtomeHBaseWriter[RECORD] extends HBaseForeachWriter[Row] {
  override def toPut(record: Row): Put = {
    override val tableName: String = "hbase-table-name"

    override def toPut(record: Row): Put = {
        // Get Json
        val data = JSON.parseFull(record.getString(0)).asInstanceOf[Some[Map[String, Object]]]
        val key = data.getOrElse(Map())("key")+ ""
        val val = data.getOrElse(Map())("val")+ ""

        val p = new Put(Bytes.toBytes(key))
        //Add columns ... 
        p.addColumn(Bytes.toBytes(columnFamaliyName),Bytes.toBytes(columnName), Bytes.toBytes(val))

        p
     }
    }

And the HBaseForeachWriter class looks like this : HBaseForeachWriter 类如下所示：

trait HBaseForeachWriter[RECORD] extends ForeachWriter[RECORD] {
  val tableName: String

  def pool: Option[ExecutorService] = None

  def user: Option[User] = None

  private var hTable: Table = _
  private var connection: Connection = _


  override def open(partitionId: Long, version: Long): Boolean = {
    connection = createConnection()
    hTable = getHTable(connection)
    true
  }

  def createConnection(): Connection = {
    // I create HBase Connection Here
  }

  def getHTable(connection: Connection): Table = {
    connection.getTable(TableName.valueOf(Variables.getTableName()))
  }

  override def process(record: RECORD): Unit = {
    val put = toPut(record)
    hTable.put(put)
  }

  override def close(errorOrNull: Throwable): Unit = {
    hTable.close()
    connection.close()
  }

  def toPut(record: RECORD): Put
}

So here I'm doing a put line by line, even if I allow 20 executors and 4 cores for each, I don't have the data inserted immediatly in HBase.因此，我在这里逐行执行放置，即使我允许每个执行程序有 20 个执行程序和 4 个内核，我也没有将数据立即插入到 HBase 中。 So what I need to do is a bulk load ut I'm struggled because all what I find in the internet is to realize it with RDDs and Map/Reduce.所以我需要做的是批量加载我很挣扎，因为我在互联网上找到的所有内容都是通过 RDD 和 Map/Reduce 来实现的。

Answer 1

What I understand is slow rate of record ingestion in to hbase.我的理解是记录摄取到 hbase 的速度很慢。 I have few suggestions to you.我给你的建议很少。

1) hbase.client.write.buffe r . 1) hbase.client.write.buffer r 。
the below property may help you.以下属性可能对您有所帮助。

 hbase.client.write.buffer
Description Default size of the BufferedMutator write buffer in bytes.说明BufferedMutator 写入缓冲区的默认大小（以字节为单位）。 A bigger buffer takes more memory — on both the client and server side since server instantiates the passed write buffer to process it — but a larger buffer size reduces the number of RPCs made.更大的缓冲区需要更多的内存——在客户端和服务器端，因为服务器实例化传递的写入缓冲区来处理它——但更大的缓冲区大小会减少 RPC 的数量。 For an estimate of server-side memory-used, evaluate hbase.client.write.buffer * hbase.regionserver.handler.count对于服务器端内存使用的估计，评估 hbase.client.write.buffer * hbase.regionserver.handler.count

Default 2097152 (around 2 mb )默认 2097152（大约 2 mb）

I prefer foreachBatch see spark docs (its kind of foreachPartition in spark core) rather foreach我更喜欢foreachBatch查看火花文档（它在火花核心中的那种 foreachPartition）而不是foreach

Also in your hbase writer extends ForeachWriter同样在您的 hbase 编写器中扩展了ForeachWriter

open method intialize array list of put in process add the put to the arraylist of puts in close table.put(listofputs); open方法初始化放入process数组列表将放入的放入数组列表中的放入close table.put(listofputs); and then reset the arraylist once you updated the table...然后在更新表后重置数组列表...

what it does basically your buffer size mentioned above is filled with 2 mb then it will flush in to hbase table.它所做的基本上是您上面提到的缓冲区大小填充了 2 mb，然后它将刷新到 hbase 表中。 till then records wont go to hbase table.在此之前，记录不会进入 hbase 表。

you can increase that to 10mb and so.... In this way number of RPCs will be reduced.您可以将其增加到 10mb 等等...。这样一来，RPC 的数量就会减少。 and huge chunk of data will be flushed and will be in hbase table.大量数据将被刷新，并将在 hbase 表中。

when write buffer is filled up and a flushCommits in to hbase table is triggered.当写缓冲区被填满并且flushCommits了到 hbase 表的flushCommits时。

Example code : in my answer示例代码：在我的回答中

2) switch off WAL you can switch off WAL(write ahead log - Danger is no recovery) but it will speed up writes... if dont want to recover the data. 2）关闭WAL，您可以关闭WAL（提前写入日志-危险无法恢复）但它会加快写入速度...如果不想恢复数据。

Note : if you are using solr or cloudera search on hbase tables you should not turn it off since Solr will work on WAL.注意：如果您在 hbase 表上使用 solr 或 cloudera 搜索，则不应将其关闭，因为 Solr 将在 WAL 上工作。 if you switch it off then, Solr indexing wont work.. this is one common mistake many of us does.如果您当时关闭它，Solr 索引将无法工作。这是我们许多人经常犯的一个错误。

How to swtich off : https://hbase.apache.org/1.1/apidocs/org/apache/hadoop/hbase/client/Put.html#setWriteToWAL(boolean)如何关闭： https ://hbase.apache.org/1.1/apidocs/org/apache/hadoop/hbase/client/Put.html#setWriteToWAL(boolean )

Basic architechture and link for further study :基础架构和进一步研究的链接：
http://hbase.apache.org/book.html#perf.writing http://hbase.apache.org/book.html#perf.writing

as I mentioned list of puts is good way... this is the old way (foreachPartition with list of puts) of doing before structured streaming example is like below .. where foreachPartition operates for each partition not every row.正如我提到的那样，放置列表是个好方法......这是结构化流示例之前的旧方法（foreachPartition 与放置列表）如下所示......其中foreachPartition为每个分区而不是每一行操作。

def writeHbase(mydataframe: DataFrame) = {
      val columnFamilyName: String = "c"
      mydataframe.foreachPartition(rows => {
        val puts = new util.ArrayList[ Put ]
        rows.foreach(row => {
          val key = row.getAs[ String ]("rowKey")
          val p = new Put(Bytes.toBytes(key))
          val columnV = row.getAs[ Double ]("x")
          val columnT = row.getAs[ Long ]("y")
          p.addColumn(
            Bytes.toBytes(columnFamilyName),
            Bytes.toBytes("x"),
            Bytes.toBytes(columnX)
          )
          p.addColumn(
            Bytes.toBytes(columnFamilyName),
            Bytes.toBytes("y"),
            Bytes.toBytes(columnY)
          )
          puts.add(p)
        })
        HBaseUtil.putRows(hbaseZookeeperQuorum, hbaseTableName, puts)
      })
    }

To sumup :总结：

What I feel is we need to understand the psycology of spark and hbase to make then effective pair.我的感觉是我们需要了解 spark 和 hbase 的心理，才能形成有效的配对。

使用结构化 Spark Streaming 在 HBase 中批量插入数据

问题描述

1 个解决方案

解决方案1
5 已采纳 2019-05-24 18:27:32

使用结构化 Spark Streaming 在 HBase 中批量插入数据

问题描述

1 个解决方案

解决方案1 5 已采纳 2019-05-24 18:27:32

解决方案1
5 已采纳 2019-05-24 18:27:32