简体   繁体   中英

read a large csv file and separate it according to conditions using scala / spark

I'm new to scala / spark and I don't know how to ask this kind of question (technical word ...). I have a large csv file, I want to read it in a dataframe and distribute it in several blocks according to a condition on columns and apply the treatment I want on each of the blocks.

Example of my csv file

VehicleID         Longitude    Latitude     Date
 12311            55.55431     25.45631     01/02/2020
 12311            55.55432     25.45634     01/02/2020
 12311            55.55433     25.45637     02/02/2020
 12311            55.55431     25.45621     02/02/2020
 12309            55.55427     25.45627     01/02/2020
 12309            55.55436     25.45655     02/02/2020
 12412            55.55441     25.45657     01/02/2020
 12412            55.55442     25.45656     02/02/2020

One of the column heading is VehicleID and Date. I would like to split the large CSV into multiple blocks so that each block will have data that belongs to the one unique VehicleID and Date value.

Like that

 Bock 1
 VehicleID         Longitude    Latitude     Date
  12311            55.55431     25.45631     01/02/2020
  12311            55.55432     25.45634     01/02/2020

 Block2
 VehicleID        Longitude    Latitude     Date
  12311            55.55433     25.45637     02/02/2020
  12311            55.55431     25.45621     02/02/2020

 Block3
 VehicleID        Longitude    Latitude     Date
 12309            55.55427     25.45627     01/02/2020

 Block4
 VehicleID        Longitude    Latitude     Date
 12309            55.55427     25.45627     02/02/2020

I want also apply this function on each block

def haversine_distance(longitude1 : Double,longitude2 : Double,latitude1 : Double,latitude2 : Double) : Double= {

  val R = 6372.8
  val dlat = math.toRadians(latitude2 - latitude1)
  val dlog = math.toRadians(longitude2 - longitude1)
  val a = math.sin(dlat / 2) * math.sin(dlat / 2) + math.cos(math.toRadians(latitude1)) * math.cos(math.toRadians(latitude2)) * math.sin(dlog / 2) * math.sin(dlog / 2)
  val c = 2 * math.atan2(math.sqrt(a), math.sqrt(1 - a))
  val distance = R * c
  distance
}

How to do thas with scala ?

Thanks.

Your function is needed to use 4 input variables, so your dataframe also should have those variables to calculate. I think this can be achieved by Window and lag functions.

import org.apache.spark.sql.expressions.Window

val w = Window.partitionBy("VehicleID", "Date").orderBy("id")

val df2 = df.withColumn("id", monotonically_increasing_id)
  .withColumn("Longitude2", lag("Longitude", 1).over(w))
  .withColumn("Latitude2", lag("Latitude", 1).over(w))
  .orderBy("id")
df2.show(false)

The result is:

+---------+---------+--------+----------+---+----------+---------+
|VehicleID|Longitude|Latitude|Date      |id |Longitude2|Latitude2|
+---------+---------+--------+----------+---+----------+---------+
|12311    |55.55431 |25.45631|01/02/2020|0  |null      |null     |
|12311    |55.55432 |25.45634|01/02/2020|1  |55.55431  |25.45631 |
|12311    |55.55433 |25.45637|02/02/2020|2  |null      |null     |
|12311    |55.55431 |25.45621|02/02/2020|3  |55.55433  |25.45637 |
|12309    |55.55427 |25.45627|01/02/2020|4  |null      |null     |
|12309    |55.55436 |25.45655|02/02/2020|5  |null      |null     |
|12412    |55.55441 |25.45657|01/02/2020|6  |null      |null     |
|12412    |55.55442 |25.45656|02/02/2020|7  |null      |null     |
+---------+---------+--------+----------+---+----------+---------+

Then, register your function as a user-defined function such as

def haversine_distance(longitude1 : Double,longitude2 : Double,latitude1 : Double,latitude2 : Double) : Double= {

  val R = 6372.8
  val dlat = math.toRadians(latitude2 - latitude1)
  val dlog = math.toRadians(longitude2 - longitude1)
  val a = math.sin(dlat / 2) * math.sin(dlat / 2) + math.cos(math.toRadians(latitude1)) * math.cos(math.toRadians(latitude2)) * math.sin(dlog / 2) * math.sin(dlog / 2)
  val c = 2 * math.atan2(math.sqrt(a), math.sqrt(1 - a))
  val distance = R * c
  distance
}

spark.udf.register("haversine_distance", haversine_distance(_: Double, _: Double, _: Double, _: Double): Double)

Finally, you can use this function in the spark SQL:

df2.withColumn("haversine_distance", expr("haversine_distance(Longitude, Longitude2, Latitude, Latitude2)"))
   .show(false)

that gives the final result:

+---------+---------+--------+----------+---+----------+---------+---------------------+
|VehicleID|Longitude|Latitude|Date      |id |Longitude2|Latitude2|haversine_distance   |
+---------+---------+--------+----------+---+----------+---------+---------------------+
|12311    |55.55431 |25.45631|01/02/2020|0  |null      |null     |null                 |
|12311    |55.55432 |25.45634|01/02/2020|1  |55.55431  |25.45631 |0.0034846437813896825|
|12311    |55.55433 |25.45637|02/02/2020|2  |null      |null     |null                 |
|12311    |55.55431 |25.45621|02/02/2020|3  |55.55433  |25.45637 |0.017909203100004076 |
|12309    |55.55427 |25.45627|01/02/2020|4  |null      |null     |null                 |
|12309    |55.55436 |25.45655|02/02/2020|5  |null      |null     |null                 |
|12412    |55.55441 |25.45657|01/02/2020|6  |null      |null     |null                 |
|12412    |55.55442 |25.45656|02/02/2020|7  |null      |null     |null                 |
+---------+---------+--------+----------+---+----------+---------+---------------------+

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM