此Spark / Scala代码的正确索引是什么？

Question

I'm just getting started with Spark and Scala. 我刚刚开始使用Spark和Scala。 I wrote the below from scratch by hand but it is pretty close to an example I was working from. 我是从头开始的，下面是手工编写的，但是与我正在使用的示例非常接近。 When I run it, I keep getting errors that seemingly conflict with each other when I make changes to the code. 当我运行它时，在更改代码时，我不断收到看似相互冲突的错误。 I'm looking to add up the number of miles driven grouped by purpose of the trip. 我希望将按旅行目的分组的行驶里程加起来。 Pretty simple but no matter what index I set fields too, it never seems happy. 非常简单，但是无论我也设置了什么索引字段，它似乎永远都不快乐。 If I set it to (fields(6).toString, fields(5).toFloat), I get an out of bounds exception. 如果我将其设置为（fields（6）.toString，fields（5）.toFloat），则会出现超出范围的异常。 If i set it to (fields(5).toString, fields(4).toFloat) it's very obviously the wrong index values. 如果我将其设置为（fields（5）.toString，fields（4）.toFloat），则很明显是错误的索引值。 Here is the scheme of the data: 这是数据的方案：

start date: date time
end date: date time
category: string
start: string
stop: string
miles: float
purpose: string

Below is the code: 下面是代码：

package net.massstreet

import org.apache.spark.SparkContext._
import org.apache.spark._
import org.apache.log4j._

object InitializeSparkApp {



  /** Convert input data to (customerID, amountSpent) tuples */
  def extractCustomerPricePairs(line: String) = {
    val fields = line.split(",")
    (fields(5).toString, fields(4).toFloat)
  }


     def main(args: Array[String]){

       Logger.getLogger("org").setLevel(Level.ERROR)

       val sc = new SparkContext("local[*]","First App")

       val data = sc.textFile("data/uber_data.csv")

       val mappedInput = data.map(extractCustomerPricePairs)

       val totalMilesByPurpose = mappedInput.reduceByKey((x,y) => (x + y))

       totalMilesByPurpose.foreach(println)

     }

}

Answer 1

In case if your data lacks miles or purpose as 如果您的数据缺乏miles或purpose

start date, end date, category, start, stop, , 
start date, end date, category, start, stop, miles,

the following code will not read empty values at the end of the lines 以下代码不会在行尾读取空值

val fields = line.split(",")

You can -1 to read the empty values until the end of the lines as 您可以-1读取空值，直到行结束为

val fields = line.split(",", -1)

Looking at your scheme start date: date time,end date: date time,category: string,start: string,stop: string,miles: float,purpose: string 查看方案的start date: date time,end date: date time,category: string,start: string,stop: string,miles: float,purpose: string

(fields(6).toString, fields(5).toFloat) seems correct as when you split a line it will be converted to an Array which starts from 0 index. (fields(6).toString, fields(5).toFloat)似乎是正确的，因为当您分割一行时，它将转换为从0索引开始的Array。 So, to be more safe you can use Try or Option while returning the tuple 因此，为了更加安全，您可以在返回tuple时使用Try或Option

 (Try(fields(6)) getOrElse("Empty"), Try(fields(5).toFloat) getOrElse(0F))

OR 要么

(Option(fields(6)) getOrElse("Empty"), Option(fields(5).toFloat) getOrElse(0F))

此Spark / Scala代码的正确索引是什么？

问题描述

1 个解决方案

解决方案1
0 已采纳 2017-08-11 01:24:04

此Spark / Scala代码的正确索引是什么？

问题描述

1 个解决方案

解决方案1 0 已采纳 2017-08-11 01:24:04

解决方案1
0 已采纳 2017-08-11 01:24:04