SQL 查询和 dataframe 使用 Spark /Java

Question

I am a beginner in spark and I got stuck in how to make a sql request using dataframe.我是 spark 的初学者，我被困在如何使用 dataframe 发出 sql 请求。

I have the two followings dataframe.我有以下两个dataframe。

df_zones
+-----------------+-----------------+----------------------+---------------------+
|id               |geomType         |geom                  |rayon                |
+-----------------+-----------------+----------------------+---------------------+
|30               |Polygon          |[00 00 00 00 01 0...] |200                  |
|32               |Point            |[00 00 00 00 01 0.. ] |320179               |
+-----------------+-----------------+----------------------+---------------------+
df_tracking
+-----------------+-----------------+----------------------+
|idZones         |Longitude        |Latitude              |               
+-----------------+-----------------+----------------------+
|[30,50,100,]     | -7.6198783      |33.5942549            |
|[20,140,39,]     |-7.6198783       |33.5942549            |
+-----------------+-----------------+----------------------+

I want to execute the following request.我想执行以下请求。

"SELECT zones.* FROM zones WHERE zones.id IN ("
                            + idZones
                            + ") AND ((zones.geomType='Polygon' AND (ST_WITHIN(ST_GeomFromText(CONCAT('POINT(',"
                            + longitude
                            + ",' ',"
                            + latitude
                            + ",')'),4326),zones.geom))) OR (   (zones.geomType='LineString' OR zones.geomType='Point') AND  ST_Intersects(ST_buffer(zones.geom,(zones.rayon/100000)),ST_GeomFromText(CONCAT('POINT(',"
                            + longitude
                            + ",' ',"
                            + latitude
                            + ",')'),4326)))) "

I'm really stuck, should I join the two data frames or what?我真的卡住了，我应该加入两个数据框还是什么？ I tried to join the two dataframes with id and idZone like this:我尝试使用 id 和 idZone 加入两个数据框，如下所示：

     df_tracking.select(explode(col("idZones").as ("idZones"))).join(df_zones,col("idZones").equalTo(df_zones.col("id")));

but it seems to me that making a join is not the right choice.但在我看来，加入并不是正确的选择。

I need you help.我需要你帮忙。

Thank you谢谢

Answer 1

You can convert df_tracking.idZones eg: [20, 140, 39] into an Array() type and use array_contains() which make stuff simpler while joining over a range of elements.您可以将df_tracking.idZones eg: [20, 140, 39]转换为Array()类型并使用array_contains() ，它可以在连接一系列元素时使事情变得更简单。

val joinDF = df_zones.join(df_tracking, array_contains($"id_Zones",$"id"))

Sample Code:示例代码：

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._

object JoinExample extends App{

val spark = SparkSession.builder()
    .master("local[8]")
    .appName("Example")
    .getOrCreate()


  import spark.implicits._

val df_zones = Seq(
      (30,"Polygon", "[00 00 00 00 01]",200),
      (32,"Point", "[00 00 00 00 01]",320179),
      (39,"Point", "[00 00 00 00 01]",320179)
      ).toDF("id","geomType","geom","rayon")

val df_tracking = Seq(
      (Array(30,50,100),"-7.6198783","33.5942549"),
      (Array(20,140,39),"-7.6198783","33.5942549"))
  .toDF("id_Zones","Longitude","Latitude")

  df_zones.show()
  df_tracking.show()


  val joinDF = df_zones.join(df_tracking, array_contains($"id_Zones",$"id"))
  joinDF.show()

Output: Output：

+---+--------+----------------+------+
| id|geomType|            geom| rayon|
+---+--------+----------------+------+
| 30| Polygon|[00 00 00 00 01]|   200|
| 32|   Point|[00 00 00 00 01]|320179|
| 39|   Point|[00 00 00 00 01]|320179|
+---+--------+----------------+------+

+-------------+----------+----------+
|     id_Zones| Longitude|  Latitude|
+-------------+----------+----------+
|[30, 50, 100]|-7.6198783|33.5942549|
|[20, 140, 39]|-7.6198783|33.5942549|
+-------------+----------+----------+

+---+--------+----------------+------+-------------+----------+----------+
| id|geomType|            geom| rayon|     id_Zones| Longitude|  Latitude|
+---+--------+----------------+------+-------------+----------+----------+
| 30| Polygon|[00 00 00 00 01]|   200|[30, 50, 100]|-7.6198783|33.5942549|
| 39|   Point|[00 00 00 00 01]|320179|[20, 140, 39]|-7.6198783|33.5942549|
+---+--------+----------------+------+-------------+----------+----------+

Edit-1: In continuation of above, query can be best transformed by defining SPARK UDF's below code snippet gives you a brief idea.编辑 1：在上面的延续中，可以通过定义SPARK UDF's代码片段来最好地转换查询给你一个简短的想法。

  // UDF Creation

  // Define Logic of (ST_WITHIN(ST_GeomFromText(CONCAT('POINT(', longitude, ' ', latitude, ')')
  // , 4326), zones.geom))
  val condition1 = (x:Int) => {1}

  // Define Logic of ST_Intersects(ST_buffer(zones.geom, (zones.rayon / 100000)),
  // ST_GeomFromText(CONCAT('POINT(', longitude, ' ', latitude, ')'), 4326))
  val condition2 = (y:Int) => {1}

  val condition1UDF = udf(condition1)
  val condition2UDF = udf(condition2)


  val joinDF = df_zones.join(df_tracking, array_contains($"id_Zones",$"id"))

  val finalDF = joinDF
      .withColumn("Condition1DerivedValue", condition1UDF(lit("000")))
      .withColumn("Condition2DerivedValue", condition2UDF(lit("000")))
      .filter(
        (col("geomType") === "Polygon" and col("Condition1DerivedValue") === 1 )
      or ((col("geomType")==="LineString" or col("geomType")==="Point")
          and $"Condition2DerivedValue" === 1
        )
      )
    .select("id","geomType","geom","rayon")

  finalDF.show()

Output: Output：

+---+--------+----------------+------+
| id|geomType|            geom| rayon|
+---+--------+----------------+------+
| 30| Polygon|[00 00 00 00 01]|   200|
| 39|   Point|[00 00 00 00 01]|320179|
+---+--------+----------------+------+

SQL 查询和 dataframe 使用 Spark /Java

问题描述

1 个解决方案

解决方案1
2 2020-07-12 11:23:32

SQL 查询和 dataframe 使用 Spark /Java

问题描述

1 个解决方案

解决方案1 2 2020-07-12 11:23:32

解决方案1
2 2020-07-12 11:23:32