Reading data from HBase using Apache Spark

Question

I have an Apache Spark application written in Scala that tries to read data from HBase and do something with it.

I've encountered ways to do just that like this and also how to do so using Spark Streaming

So I wrote the following code:

def main(args: Array[String]): Unit = {
    val configuration = HBaseConfiguration.create()
    configuration.set(TableInputFormat.INPUT_TABLE, "urls")
    configuration.set(TableInputFormat.SCAN_COLUMNS, "values:words")
    val hbaseRdd = sc.newAPIHadoopRDD(configuration,
        classOf[TableInputFormat],
        classOf[ImmutableBytesWritable],
        classOf[Result]
    )
    val data = hbaseRdd.map(entry => {
      val result = entry._2
      Bytes.toString(result.getRow)
    })
    data.foreach(println)
}

My HBase table is created like this: create 'urls', {NAME => 'values', VERSIONS => 5}

What I'm getting is:

16/03/10 17:10:17 ERROR TableInputFormat: java.io.IOException: java.lang.reflect.InvocationTargetException
    at org.apache.hadoop.hbase.client.ConnectionFactory.createConnection(ConnectionFactory.java:240)
    at org.apache.hadoop.hbase.client.ConnectionFactory.createConnection(ConnectionFactory.java:218)
    at org.apache.hadoop.hbase.client.ConnectionFactory.createConnection(ConnectionFactory.java:119)
    at org.apache.hadoop.hbase.mapreduce.TableInputFormat.initialize(TableInputFormat.java:183)
    at org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.getSplits(TableInputFormatBase.java:241)
    at org.apache.hadoop.hbase.mapreduce.TableInputFormat.getSplits(TableInputFormat.java:237)

After reading about this exception here I should probably add this as part of the stack trace:

Caused by: java.lang.reflect.InvocationTargetException
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
    at org.apache.hadoop.hbase.client.ConnectionFactory.createConnection(ConnectionFactory.java:238)
    ... 34 more
Caused by: java.lang.ClassCastException: org.apache.hadoop.hbase.ipc.RpcClientImpl cannot be cast to org.apache.hadoop.hbase.ipc.RpcClient
    at org.apache.hadoop.hbase.ipc.RpcClientFactory.createClient(RpcClientFactory.java:64)
    at org.apache.hadoop.hbase.ipc.RpcClientFactory.createClient(RpcClientFactory.java:48)
    at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.<init>(ConnectionManager.java:637)
    ... 39 more

My questions are:

Can someone show some basic way of retrieving data from HBase using Spark. Preferably something more updated than the links I've shown and
also if I'm doing something wrong in the code I'll appreciate it if you could show me what

It could be even better if I could somehow read the data as a dataframe

I'm using Spark 1.6.0 and HBase 1.2.0

Thanks in advance

Answer 1

Ok so apparently it was an unexpected dependencies issue (as it always when it doesn't make any sense).

These are the steps I took in order to solve this issue (hopefully they will help future developers):

I created a clean project with the exact same code. This worked with no issues which immediately made me suspect it's some sort of a dependency issue
To make sure, I put the HBase dependency at the top of the dependencies. This created a different exception related to Spark and security, more specifically: javax.servlet.FilterRegistration
I then encountered this useful solution which solved the issue for me. I had to exclude all the javax and mortbay jetty from my pom. This fixed all my issues

Thats it :)

Reading data from HBase using Apache Spark

Question

1 answers

solution1
0 ACCPTED 2016-03-13 09:58:33

Reading data from HBase using Apache Spark

Question

1 answers

solution1 0 ACCPTED 2016-03-13 09:58:33

solution1
0 ACCPTED 2016-03-13 09:58:33