Recently I am working in a Spark Application and as part of project the dataset is read from HBase Server and Spark sql modifies the data read and saves to Kafka.
The problem I am facing is I can't test spark.sql locally. Every time I have to submit the application jar and run in Server. In Sql we have tools to test all the queries in local environment.
Is there a way or other tools where I can test spark sql locally by reading data from HBase?
I tried hbaseExplorer but it does not solve the problem.
Thanks,
If you are talking about unit testing your Spark SQL queries. You can always create DataSet locally and run queries against them
scala> val df = List(( 1 , false , 1.0 ),
| (2 , true , 2.0 )
| ).toDF("col1", "col2","col3" )
df: org.apache.spark.sql.DataFrame = [col1: int, col2: boolean ... 1 more field]
scala> df.registerTempTable("myTable")
scala> sql("select sum(col3) from myTable").show
+---------+
|sum(col3)|
+---------+
| 3.0|
+---------+
Using Apache Phoenix
If you have access to Apache Phoenix
, Open spark-shell in your local and connect to Apache Phoenix
using JDBC connection details.
Using Direct Connection to HBase
You can also connect HBase
directly from your local spark-shell
, It is somewhat difficult if your cluster is secured or kerbrose enabled.
Using Export Sample Data
(easy way & will save lot of time also.)
For testing purpose,
json
or csv
or any other formats you likespark.sql('CREATE TABLE HbaseTable ..')
DataFrame
DataFrame
data to newly created table.Check below steps for your reference.
/tmp/spark > ls -ltr
total 0
drwxr-xr-x 14 srinivas wheel 448 Nov 20 02:45 data
/tmp/spark > ls -ltr data
total 40
-rw-r--r-- 1 srinivas wheel 9 Nov 20 02:45 part-00000-4f5f5245-f664-426b-8204-a981871a1205-c000.json
-rw-r--r-- 1 srinivas wheel 9 Nov 20 02:45 part-00004-4f5f5245-f664-426b-8204-a981871a1205-c000.json
-rw-r--r-- 1 srinivas wheel 9 Nov 20 02:45 part-00002-4f5f5245-f664-426b-8204-a981871a1205-c000.json
-rw-r--r-- 1 srinivas wheel 9 Nov 20 02:45 part-00003-4f5f5245-f664-426b-8204-a981871a1205-c000.json
-rw-r--r-- 1 srinivas wheel 9 Nov 20 02:45 part-00001-4f5f5245-f664-426b-8204-a981871a1205-c000.json
Open spark-shell
in path /tmp/spark
/tmp/spark > spark-shell
scala> val df = spark.read.json("/tmp/spark/data")
df: org.apache.spark.sql.DataFrame = [id: bigint]
scala> spark.sql("create table HBaseTable(id int) stored as orc")
res0: org.apache.spark.sql.DataFrame = []
scala> df.write.insertInto("HbaseTable")
scala> spark.sql("select * from HbaseTable").show(false)
+---+
|id |
+---+
|4 |
|3 |
|1 |
|5 |
|2 |
+---+
scala> :q
/tmp/spark > ls -ltr
total 8
drwxr-xr-x 14 srinivas wheel 448 Nov 20 02:45 data
-rw-r--r-- 1 srinivas wheel 700 Nov 20 02:45 derby.log
drwxr-xr-x 9 srinivas wheel 288 Nov 20 02:45 metastore_db
drwxr-xr-x 3 srinivas wheel 96 Nov 20 02:46 spark-warehouse
/tmp/spark > ls -ltr spark-warehouse
total 0
drwxr-xr-x 12 srinivas wheel 384 Nov 20 02:46 hbasetable
/tmp/spark > ls -ltr spark-warehouse/hbasetable
total 40
-rwxr-xr-x 1 srinivas wheel 196 Nov 20 02:46 part-00002-5a3504cd-71c1-46fa-833f-76bf9178e46f-c000
-rwxr-xr-x 1 srinivas wheel 196 Nov 20 02:46 part-00001-5a3504cd-71c1-46fa-833f-76bf9178e46f-c000
-rwxr-xr-x 1 srinivas wheel 196 Nov 20 02:46 part-00003-5a3504cd-71c1-46fa-833f-76bf9178e46f-c000
-rwxr-xr-x 1 srinivas wheel 196 Nov 20 02:46 part-00000-5a3504cd-71c1-46fa-833f-76bf9178e46f-c000
-rwxr-xr-x 1 srinivas wheel 196 Nov 20 02:46 part-00004-5a3504cd-71c1-46fa-833f-76bf9178e46f-c000
Note - From Next time onwards if you want to do any testing on your hbase data you have to open your spark-shell from /tmp/spark
same directory where you have created table
, It will not work if you open spark-shell in different directory and access HbaseTable
table.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.