[英]SQL Query and dataframe using Spark /Java
I am a beginner in spark and I got stuck in how to make a sql request using dataframe.我是 spark 的初学者,我被困在如何使用 dataframe 发出 sql 请求。
I have the two followings dataframe.我有以下两个dataframe。
df_zones
+-----------------+-----------------+----------------------+---------------------+
|id |geomType |geom |rayon |
+-----------------+-----------------+----------------------+---------------------+
|30 |Polygon |[00 00 00 00 01 0...] |200 |
|32 |Point |[00 00 00 00 01 0.. ] |320179 |
+-----------------+-----------------+----------------------+---------------------+
df_tracking
+-----------------+-----------------+----------------------+
|idZones |Longitude |Latitude |
+-----------------+-----------------+----------------------+
|[30,50,100,] | -7.6198783 |33.5942549 |
|[20,140,39,] |-7.6198783 |33.5942549 |
+-----------------+-----------------+----------------------+
I want to execute the following request.我想执行以下请求。
"SELECT zones.* FROM zones WHERE zones.id IN ("
+ idZones
+ ") AND ((zones.geomType='Polygon' AND (ST_WITHIN(ST_GeomFromText(CONCAT('POINT(',"
+ longitude
+ ",' ',"
+ latitude
+ ",')'),4326),zones.geom))) OR ( (zones.geomType='LineString' OR zones.geomType='Point') AND ST_Intersects(ST_buffer(zones.geom,(zones.rayon/100000)),ST_GeomFromText(CONCAT('POINT(',"
+ longitude
+ ",' ',"
+ latitude
+ ",')'),4326)))) "
I'm really stuck, should I join the two data frames or what?我真的卡住了,我应该加入两个数据框还是什么? I tried to join the two dataframes with id and idZone like this:
我尝试使用 id 和 idZone 加入两个数据框,如下所示:
df_tracking.select(explode(col("idZones").as ("idZones"))).join(df_zones,col("idZones").equalTo(df_zones.col("id")));
but it seems to me that making a join is not the right choice.但在我看来,加入并不是正确的选择。
I need you help.我需要你帮忙。
Thank you谢谢
You can convert df_tracking.idZones eg: [20, 140, 39]
into an Array()
type and use array_contains()
which make stuff simpler while joining over a range of elements.您可以将
df_tracking.idZones eg: [20, 140, 39]
转换为Array()
类型并使用array_contains()
,它可以在连接一系列元素时使事情变得更简单。
val joinDF = df_zones.join(df_tracking, array_contains($"id_Zones",$"id"))
Sample Code:示例代码:
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
object JoinExample extends App{
val spark = SparkSession.builder()
.master("local[8]")
.appName("Example")
.getOrCreate()
import spark.implicits._
val df_zones = Seq(
(30,"Polygon", "[00 00 00 00 01]",200),
(32,"Point", "[00 00 00 00 01]",320179),
(39,"Point", "[00 00 00 00 01]",320179)
).toDF("id","geomType","geom","rayon")
val df_tracking = Seq(
(Array(30,50,100),"-7.6198783","33.5942549"),
(Array(20,140,39),"-7.6198783","33.5942549"))
.toDF("id_Zones","Longitude","Latitude")
df_zones.show()
df_tracking.show()
val joinDF = df_zones.join(df_tracking, array_contains($"id_Zones",$"id"))
joinDF.show()
Output: Output:
+---+--------+----------------+------+
| id|geomType| geom| rayon|
+---+--------+----------------+------+
| 30| Polygon|[00 00 00 00 01]| 200|
| 32| Point|[00 00 00 00 01]|320179|
| 39| Point|[00 00 00 00 01]|320179|
+---+--------+----------------+------+
+-------------+----------+----------+
| id_Zones| Longitude| Latitude|
+-------------+----------+----------+
|[30, 50, 100]|-7.6198783|33.5942549|
|[20, 140, 39]|-7.6198783|33.5942549|
+-------------+----------+----------+
+---+--------+----------------+------+-------------+----------+----------+
| id|geomType| geom| rayon| id_Zones| Longitude| Latitude|
+---+--------+----------------+------+-------------+----------+----------+
| 30| Polygon|[00 00 00 00 01]| 200|[30, 50, 100]|-7.6198783|33.5942549|
| 39| Point|[00 00 00 00 01]|320179|[20, 140, 39]|-7.6198783|33.5942549|
+---+--------+----------------+------+-------------+----------+----------+
Edit-1: In continuation of above, query can be best transformed by defining SPARK UDF's
below code snippet gives you a brief idea.编辑 1:在上面的延续中,可以通过定义
SPARK UDF's
代码片段来最好地转换查询给你一个简短的想法。
// UDF Creation
// Define Logic of (ST_WITHIN(ST_GeomFromText(CONCAT('POINT(', longitude, ' ', latitude, ')')
// , 4326), zones.geom))
val condition1 = (x:Int) => {1}
// Define Logic of ST_Intersects(ST_buffer(zones.geom, (zones.rayon / 100000)),
// ST_GeomFromText(CONCAT('POINT(', longitude, ' ', latitude, ')'), 4326))
val condition2 = (y:Int) => {1}
val condition1UDF = udf(condition1)
val condition2UDF = udf(condition2)
val joinDF = df_zones.join(df_tracking, array_contains($"id_Zones",$"id"))
val finalDF = joinDF
.withColumn("Condition1DerivedValue", condition1UDF(lit("000")))
.withColumn("Condition2DerivedValue", condition2UDF(lit("000")))
.filter(
(col("geomType") === "Polygon" and col("Condition1DerivedValue") === 1 )
or ((col("geomType")==="LineString" or col("geomType")==="Point")
and $"Condition2DerivedValue" === 1
)
)
.select("id","geomType","geom","rayon")
finalDF.show()
Output: Output:
+---+--------+----------------+------+
| id|geomType| geom| rayon|
+---+--------+----------------+------+
| 30| Polygon|[00 00 00 00 01]| 200|
| 39| Point|[00 00 00 00 01]|320179|
+---+--------+----------------+------+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.