本地测试 Spark Sql 查询

Question

Recently I am working in a Spark Application and as part of project the dataset is read from HBase Server and Spark sql modifies the data read and saves to Kafka.最近我在一个 Spark 应用程序中工作，作为项目的一部分，从 HBase 服务器读取数据集，Spark sql 修改读取的数据并保存到 Kafka。

The problem I am facing is I can't test spark.sql locally.我面临的问题是我无法在本地测试 spark.sql。 Every time I have to submit the application jar and run in Server.每次我必须提交应用程序jar并在服务器中运行。 In Sql we have tools to test all the queries in local environment.在 Sql 中，我们有工具来测试本地环境中的所有查询。

Is there a way or other tools where I can test spark sql locally by reading data from HBase?有没有办法或其他工具可以通过从 HBase 读取数据来在本地测试 spark sql？

I tried hbaseExplorer but it does not solve the problem.我试过 hbaseExplorer 但它没有解决问题。

Thanks,谢谢，

Answer 1

If you are talking about unit testing your Spark SQL queries.如果您正在谈论对 Spark SQL 查询进行单元测试。 You can always create DataSet locally and run queries against them您始终可以在本地创建 DataSet 并针对它们运行查询

scala> val df = List(( 1 , false , 1.0 ),
 |         (2 , true , 2.0 )
 |         ).toDF("col1", "col2","col3" )
 df: org.apache.spark.sql.DataFrame = [col1: int, col2: boolean ... 1 more field]
 scala> df.registerTempTable("myTable")
 scala> sql("select sum(col3) from myTable").show
 +---------+
 |sum(col3)|
 +---------+
 |      3.0|
 +---------+

Answer 2

Using Apache Phoenix使用Apache Phoenix

If you have access to Apache Phoenix , Open spark-shell in your local and connect to Apache Phoenix using JDBC connection details.如果您有权访问Apache Phoenix ，请在本地打开 spark-shell 并使用 JDBC 连接详细信息连接到Apache Phoenix 。

Using Direct Connection to HBase You can also connect HBase directly from your local spark-shell , It is somewhat difficult if your cluster is secured or kerbrose enabled.使用Direct Connection to HBase您也可以直接从本地spark-shell连接HBase ，如果您的集群是安全的或启用了 kerbrose，这有点困难。

Using Export Sample Data (easy way & will save lot of time also.)使用Export Sample Data （简单的方法，也会节省大量时间。）

For testing purpose,出于测试目的，

Export sample data from your HBase into json or csv or any other formats you like将 HBase 中的示例数据导出为json或csv或您喜欢的任何其他格式
Download that data into your local system.将该数据下载到您的本地系统中。
Use your spark shell to create table with the same structure of your HBase table using this command - spark.sql('CREATE TABLE HbaseTable ..')使用 spark shell 使用此命令创建与 HBase 表结构相同的表 - spark.sql('CREATE TABLE HbaseTable ..')
Load downloaded sample data into DataFrame将下载的样本数据加载到DataFrame
Write DataFrame data to newly created table.将DataFrame数据写入新创建的表。

Check below steps for your reference.检查以下步骤供您参考。

/tmp/spark > ls -ltr
total 0
drwxr-xr-x  14 srinivas  wheel  448 Nov 20 02:45 data

/tmp/spark > ls -ltr data
total 40
-rw-r--r--  1 srinivas  wheel  9 Nov 20 02:45 part-00000-4f5f5245-f664-426b-8204-a981871a1205-c000.json
-rw-r--r--  1 srinivas  wheel  9 Nov 20 02:45 part-00004-4f5f5245-f664-426b-8204-a981871a1205-c000.json
-rw-r--r--  1 srinivas  wheel  9 Nov 20 02:45 part-00002-4f5f5245-f664-426b-8204-a981871a1205-c000.json
-rw-r--r--  1 srinivas  wheel  9 Nov 20 02:45 part-00003-4f5f5245-f664-426b-8204-a981871a1205-c000.json
-rw-r--r--  1 srinivas  wheel  9 Nov 20 02:45 part-00001-4f5f5245-f664-426b-8204-a981871a1205-c000.json

Open spark-shell in path /tmp/spark在路径/tmp/spark打开spark-shell

/tmp/spark > spark-shell

scala> val df = spark.read.json("/tmp/spark/data")
df: org.apache.spark.sql.DataFrame = [id: bigint]

scala> spark.sql("create table HBaseTable(id int) stored as orc")
res0: org.apache.spark.sql.DataFrame = []

scala> df.write.insertInto("HbaseTable")

scala> spark.sql("select * from HbaseTable").show(false)
+---+
|id |
+---+
|4  |
|3  |
|1  |
|5  |
|2  |
+---+
scala> :q

/tmp/spark > ls -ltr
total 8
drwxr-xr-x  14 srinivas  wheel  448 Nov 20 02:45 data
-rw-r--r--   1 srinivas  wheel  700 Nov 20 02:45 derby.log
drwxr-xr-x   9 srinivas  wheel  288 Nov 20 02:45 metastore_db
drwxr-xr-x   3 srinivas  wheel   96 Nov 20 02:46 spark-warehouse

/tmp/spark > ls -ltr spark-warehouse
total 0
drwxr-xr-x  12 srinivas  wheel  384 Nov 20 02:46 hbasetable

/tmp/spark > ls -ltr spark-warehouse/hbasetable
total 40
-rwxr-xr-x  1 srinivas  wheel  196 Nov 20 02:46 part-00002-5a3504cd-71c1-46fa-833f-76bf9178e46f-c000
-rwxr-xr-x  1 srinivas  wheel  196 Nov 20 02:46 part-00001-5a3504cd-71c1-46fa-833f-76bf9178e46f-c000
-rwxr-xr-x  1 srinivas  wheel  196 Nov 20 02:46 part-00003-5a3504cd-71c1-46fa-833f-76bf9178e46f-c000
-rwxr-xr-x  1 srinivas  wheel  196 Nov 20 02:46 part-00000-5a3504cd-71c1-46fa-833f-76bf9178e46f-c000
-rwxr-xr-x  1 srinivas  wheel  196 Nov 20 02:46 part-00004-5a3504cd-71c1-46fa-833f-76bf9178e46f-c000

Note - From Next time onwards if you want to do any testing on your hbase data you have to open your spark-shell from /tmp/spark same directory where you have created table , It will not work if you open spark-shell in different directory and access HbaseTable table.注意- 从下一次开始，如果您想对您的 hbase 数据进行任何测试，您必须从/tmp/spark与您创建table同一目录中打开您的 spark-shell，如果您在不同的位置打开 spark-shell，它将无法正常工作目录并访问HbaseTable表。

本地测试 Spark Sql 查询

问题描述

2 个解决方案

解决方案1
2 2020-11-19 19:53:33

解决方案2
1 2020-11-19 21:25:44

本地测试 Spark Sql 查询

问题描述

2 个解决方案

解决方案1 2 2020-11-19 19:53:33

解决方案2 1 2020-11-19 21:25:44

解决方案1
2 2020-11-19 19:53:33

解决方案2
1 2020-11-19 21:25:44