简体   繁体   English

如何使用Scala在Spark中联接两个Hbase表

[英]How to join two Hbase tables in Spark using Scala

I have two tables in HBase that I need to join using scala. 我需要使用scala将HBase中的两个表连接起来。 The tables are imported from Oracle using sqoop and are available for querying in the Hue data browser 这些表是使用sqoop从Oracle导入的,可在Hue数据浏览器中查询

Using Spark 1.5, Scala 2.10.4. 使用Spark 1.5,Scala 2.10.4。

I'm using the HBase data connector from here: https://github.com/nerdammer/spark-hbase-connector 我正在从这里使用HBase数据连接器: https : //github.com/nerdammer/spark-hbase-connector

import it.nerdammer.spark.hbase._
import org.apache.hadoop.hbase.client.{ HBaseAdmin, Result }
import org.apache.hadoop.hbase.{ HBaseConfiguration, HTableDescriptor }
import org.apache.hadoop.hbase.mapreduce.TableInputFormat
import org.apache.hadoop.hbase.io.ImmutableBytesWritable
import org.apache.spark._
import it.nerdammer.spark.hbase.conversion.{ FieldReader, FieldWriter }
import org.apache.hadoop.hbase.util.Bytes

case class Artist(id: String,
                 name: String,
                 age: Int);

case class Cd(id: String,
              artistId: String,
              title: String,
              year: Int);

case class ArtistCd(id: String,
                    name: String,
                    title: String,
                    year: Int);

implicit def artistReader: FieldReader[Artist] = new FieldReader[Artist] {

    override def map(data: HBaseData): Artist = Artist(

        id = Bytes.toString(data.head.get),
        name = Bytes.toString(data.drop(1).head.get),
        age = Bytes.toInt(data.drop(2).head.get));

    override def columns = Seq("NAME", "AGE");

};

implicit def cdReader: FieldReader[Cd] = new FieldReader[Cd] {

    override def map(data: HBaseData): Cd = Cd(

        id = Bytes.toString(data.head.get),
        artistId = Bytes.toString(data.drop(1).head.get),
        title = Bytes.toString(data.drop(2).head.get),
        year = Bytes.toInt(data.drop(3).head.get));

    override def columns = Seq("ARTIST_ID", "TITLE", "YEAR");

};

implicit def artistCdWriter: FieldWriter[ArtistCd] = new FieldWriter[ArtistCd] {
    override def map(data: ArtistCd): HBaseData =
        Seq(
            Some(Bytes.toBytes(data.id)),
            Some(Bytes.toBytes(data.name)),
            Some(Bytes.toBytes(data.title)),
            Some(Bytes.toBytes(data.year)));

    override def columns = Seq("NAME", "TITLE", "YEAR");
};

val conf = new SparkConf().setAppName("HBase Join").setMaster("spark://localhost:7337")
val sc = new SparkContext(conf)

val artistRDD = sc.hbaseTable[Artist]("ARTISTS").inColumnFamily("cf")
val cdRDD = sc.hbaseTable[Cd]("CDS").inColumnFamily("cf")

val artistById = artistRDD.keyBy(f => f.id)
val cdById = cdRDD.keyBy(f => f.artistId)

val artistcd = artistById.join(cdById)

val artistCdRDD = artistcd.map(f => new ArtistCd(f._2._1.id, f._2._2.title, f._2._1.name, f._2._2.year))
artistCdRDD.toHBaseTable("ARTIST_CD").inColumnFamily("cf").save()
System.exit(1)

When I run this I get the following exception 运行此命令时,出现以下异常

16/01/22 14:27:04 WARN TaskSetManager: Lost task 0.0 in stage 5.0 (TID 3, localhost): org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: Failed 2068 actions: ARTIST_CD: 2068 times,
        at org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.makeException(AsyncProcess.java:227)
        at org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.access$1700(AsyncProcess.java:207)
        at org.apache.hadoop.hbase.client.AsyncProcess.waitForAllPreviousOpsAndReset(AsyncProcess.java:1663)
        at org.apache.hadoop.hbase.client.BufferedMutatorImpl.backgroundFlushCommits(BufferedMutatorImpl.java:208)
        at org.apache.hadoop.hbase.client.BufferedMutatorImpl.doMutate(BufferedMutatorImpl.java:141)
        at org.apache.hadoop.hbase.client.BufferedMutatorImpl.mutate(BufferedMutatorImpl.java:98)
        at org.apache.hadoop.hbase.mapreduce.TableOutputFormat$TableRecordWriter.write(TableOutputFormat.java:129)
        at org.apache.hadoop.hbase.mapreduce.TableOutputFormat$TableRecordWriter.write(TableOutputFormat.java:85)
        at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$4.apply$mcV$sp(PairRDDFunctions.scala:1036)
        at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$4.apply(PairRDDFunctions.scala:1034)
        at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$4.apply(PairRDDFunctions.scala:1034)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1206)
        at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1042)
        at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1014)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
        at org.apache.spark.scheduler.Task.run(Task.scala:88)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)

If anyone has any experience in this, I'd really appreciate your help 如果有人对此有任何经验,我将非常感谢您的帮助

I've seen the two solutions here How to Join two tables in Hbase and how to join tables in hbase and unfortunately neither is going to work for me 我在这里看到了两个解决方案, 如何在Hbase中联接两个表以及如何在hbase中联接表 ,不幸的是,这两个方案都不适合我

Figured it out - first the new table needs to already exist. 弄清楚了-首先,新表必须已经存在。 I had thought the save() command would create it, but no. 我以为save()命令可以创建它,但是没有。 Also, the new table has to have the column family you're saving to - here "cf" 另外,新表必须具有要保存到的列族-这里是“ cf”

example 1) 例子1)

spark-shell --driver-class-path= {put apache lib path}:  {put hbase lib path}

spark-shell --driver-class-path=/usr/local/Cellar/apache-spark/2.4.0/libexec/jars/* :/usr/local/Cellar/hbase-1.4.9/lib/*

example 2) 示例2)

spark-shell --driver-class-path=$SPARK_HOME:$(hbase classpath)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM