I'm new to scala / spark and I don't know how to ask this kind of question (technical word ...). I have a large csv file, I want to read it in a dataframe and distribute it in several blocks according to a condition on columns and apply the treatment I want on each of the blocks.
Example of my csv file
VehicleID Longitude Latitude Date
12311 55.55431 25.45631 01/02/2020
12311 55.55432 25.45634 01/02/2020
12311 55.55433 25.45637 02/02/2020
12311 55.55431 25.45621 02/02/2020
12309 55.55427 25.45627 01/02/2020
12309 55.55436 25.45655 02/02/2020
12412 55.55441 25.45657 01/02/2020
12412 55.55442 25.45656 02/02/2020
One of the column heading is VehicleID and Date. I would like to split the large CSV into multiple blocks so that each block will have data that belongs to the one unique VehicleID and Date value.
Like that
Bock 1
VehicleID Longitude Latitude Date
12311 55.55431 25.45631 01/02/2020
12311 55.55432 25.45634 01/02/2020
Block2
VehicleID Longitude Latitude Date
12311 55.55433 25.45637 02/02/2020
12311 55.55431 25.45621 02/02/2020
Block3
VehicleID Longitude Latitude Date
12309 55.55427 25.45627 01/02/2020
Block4
VehicleID Longitude Latitude Date
12309 55.55427 25.45627 02/02/2020
I want also apply this function on each block
def haversine_distance(longitude1 : Double,longitude2 : Double,latitude1 : Double,latitude2 : Double) : Double= {
val R = 6372.8
val dlat = math.toRadians(latitude2 - latitude1)
val dlog = math.toRadians(longitude2 - longitude1)
val a = math.sin(dlat / 2) * math.sin(dlat / 2) + math.cos(math.toRadians(latitude1)) * math.cos(math.toRadians(latitude2)) * math.sin(dlog / 2) * math.sin(dlog / 2)
val c = 2 * math.atan2(math.sqrt(a), math.sqrt(1 - a))
val distance = R * c
distance
}
How to do thas with scala ?
Thanks.
Your function is needed to use 4 input variables, so your dataframe also should have those variables to calculate. I think this can be achieved by Window
and lag
functions.
import org.apache.spark.sql.expressions.Window
val w = Window.partitionBy("VehicleID", "Date").orderBy("id")
val df2 = df.withColumn("id", monotonically_increasing_id)
.withColumn("Longitude2", lag("Longitude", 1).over(w))
.withColumn("Latitude2", lag("Latitude", 1).over(w))
.orderBy("id")
df2.show(false)
The result is:
+---------+---------+--------+----------+---+----------+---------+
|VehicleID|Longitude|Latitude|Date |id |Longitude2|Latitude2|
+---------+---------+--------+----------+---+----------+---------+
|12311 |55.55431 |25.45631|01/02/2020|0 |null |null |
|12311 |55.55432 |25.45634|01/02/2020|1 |55.55431 |25.45631 |
|12311 |55.55433 |25.45637|02/02/2020|2 |null |null |
|12311 |55.55431 |25.45621|02/02/2020|3 |55.55433 |25.45637 |
|12309 |55.55427 |25.45627|01/02/2020|4 |null |null |
|12309 |55.55436 |25.45655|02/02/2020|5 |null |null |
|12412 |55.55441 |25.45657|01/02/2020|6 |null |null |
|12412 |55.55442 |25.45656|02/02/2020|7 |null |null |
+---------+---------+--------+----------+---+----------+---------+
Then, register your function as a user-defined function such as
def haversine_distance(longitude1 : Double,longitude2 : Double,latitude1 : Double,latitude2 : Double) : Double= {
val R = 6372.8
val dlat = math.toRadians(latitude2 - latitude1)
val dlog = math.toRadians(longitude2 - longitude1)
val a = math.sin(dlat / 2) * math.sin(dlat / 2) + math.cos(math.toRadians(latitude1)) * math.cos(math.toRadians(latitude2)) * math.sin(dlog / 2) * math.sin(dlog / 2)
val c = 2 * math.atan2(math.sqrt(a), math.sqrt(1 - a))
val distance = R * c
distance
}
spark.udf.register("haversine_distance", haversine_distance(_: Double, _: Double, _: Double, _: Double): Double)
Finally, you can use this function in the spark SQL:
df2.withColumn("haversine_distance", expr("haversine_distance(Longitude, Longitude2, Latitude, Latitude2)"))
.show(false)
that gives the final result:
+---------+---------+--------+----------+---+----------+---------+---------------------+
|VehicleID|Longitude|Latitude|Date |id |Longitude2|Latitude2|haversine_distance |
+---------+---------+--------+----------+---+----------+---------+---------------------+
|12311 |55.55431 |25.45631|01/02/2020|0 |null |null |null |
|12311 |55.55432 |25.45634|01/02/2020|1 |55.55431 |25.45631 |0.0034846437813896825|
|12311 |55.55433 |25.45637|02/02/2020|2 |null |null |null |
|12311 |55.55431 |25.45621|02/02/2020|3 |55.55433 |25.45637 |0.017909203100004076 |
|12309 |55.55427 |25.45627|01/02/2020|4 |null |null |null |
|12309 |55.55436 |25.45655|02/02/2020|5 |null |null |null |
|12412 |55.55441 |25.45657|01/02/2020|6 |null |null |null |
|12412 |55.55442 |25.45656|02/02/2020|7 |null |null |null |
+---------+---------+--------+----------+---+----------+---------+---------------------+
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.