简体   繁体   中英

What are the proper indexes for this Spark/Scala code?

I'm just getting started with Spark and Scala. I wrote the below from scratch by hand but it is pretty close to an example I was working from. When I run it, I keep getting errors that seemingly conflict with each other when I make changes to the code. I'm looking to add up the number of miles driven grouped by purpose of the trip. Pretty simple but no matter what index I set fields too, it never seems happy. If I set it to (fields(6).toString, fields(5).toFloat), I get an out of bounds exception. If i set it to (fields(5).toString, fields(4).toFloat) it's very obviously the wrong index values. Here is the scheme of the data:

start date: date time
end date: date time
category: string
start: string
stop: string
miles: float
purpose: string

Below is the code:

package net.massstreet

import org.apache.spark.SparkContext._
import org.apache.spark._
import org.apache.log4j._

object InitializeSparkApp {



  /** Convert input data to (customerID, amountSpent) tuples */
  def extractCustomerPricePairs(line: String) = {
    val fields = line.split(",")
    (fields(5).toString, fields(4).toFloat)
  }


     def main(args: Array[String]){

       Logger.getLogger("org").setLevel(Level.ERROR)

       val sc = new SparkContext("local[*]","First App")

       val data = sc.textFile("data/uber_data.csv")

       val mappedInput = data.map(extractCustomerPricePairs)

       val totalMilesByPurpose = mappedInput.reduceByKey((x,y) => (x + y))

       totalMilesByPurpose.foreach(println)

     }

}

In case if your data lacks miles or purpose as

start date, end date, category, start, stop, , 
start date, end date, category, start, stop, miles,

the following code will not read empty values at the end of the lines

val fields = line.split(",")

You can -1 to read the empty values until the end of the lines as

val fields = line.split(",", -1)

Looking at your scheme start date: date time,end date: date time,category: string,start: string,stop: string,miles: float,purpose: string

(fields(6).toString, fields(5).toFloat) seems correct as when you split a line it will be converted to an Array which starts from 0 index. So, to be more safe you can use Try or Option while returning the tuple

 (Try(fields(6)) getOrElse("Empty"), Try(fields(5).toFloat) getOrElse(0F))

OR

(Option(fields(6)) getOrElse("Empty"), Option(fields(5).toFloat) getOrElse(0F))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM