简体   繁体   中英

Geo Filter with Spark DataFrame

I'm new to dataframes with spark and it's sometimes weird. Let's say I have a dataframe containing Logs with Latitude and Longitude coordinates.

 LogsDataFrame.printSchema :
 root
 |-- lat: double (nullable = false)
 |-- lon: double (nullable = false)
 |-- imp: string (nullable = false)
 |-- log_date: string (nullable = true)
 |-- pubuid: string (nullable = true)

On the other hand I have a simple method

within(lat : Double, long : Double, radius : Double) : Boolean

that tells if lat and lon are in a certain radius of a pre-defined location.

Now, how do I filter point Log that do not satisfy within. I tried

logsDataFrame.filter(within(logsDF("lat"), logsDF("lon"), RADIUS)

But it does not infer the Double and instead it gives back Column as type. How can I get this working? The docs in the spark site are a bit simplistic, I'm sure I'm missing something.

Thank you for your help.

Generally speaking you need at least two things to make it work. First you have to create an UDF wrapping within :

import org.apache.spark.sql.functions.{udf, lit}

val withinUDF = udf(within _)

Next, when the UDF is called, radius should be marked as a literal:

df.where(withinUDF($"lat", $"long", lit(RADIUS)))

Since not every type can be passed this way and creating wrappers and calling lit is rather tedious you may prefer currying:

def within(radius: Double) = udf((lat: Double, long: Double) => ???)

df.where(within(RADIUS)($"lat", $"long"))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM