简体   繁体   English

将数据转换为Spark Scala中的类对象列表

[英]Transformation of data into list of objects of class in spark scala

I am trying to write a spark transformation code to convert the below data into list of objects of following class, I am totally new to scala and spark and tried splitting the data and put them into case class but I was unable to append them back. 我正在尝试编写一个spark转换代码,以将以下数据转换为以下类的对象列表,我对scala和spark完全陌生,并尝试将数据拆分并放入case类中,但无法将它们附加回去。 Request your help on this. 请求您的帮助。

Data : 资料:

FirstName,LastName,Country,match,Goals
Cristiano,Ronaldo,Portugal,Match1,1
Cristiano,Ronaldo,Portugal,Match2,1
Cristiano,Ronaldo,Portugal,Match3,0
Cristiano,Ronaldo,Portugal,Match4,2
Lionel,Messi,Argentina,Match1,1
Lionel,Messi,Argentina,Match2,2
Lionel,Messi,Argentina,Match3,1
Lionel,Messi,Argentina,Match4,2

Desired output: 所需的输出:

PLayerStats{ String FirstName,
    String LastName,
    String Country,
    Map <String,Int> matchandscore
}

Assuming you already loaded data into an RDD[String] named data : 假设您已经将数据加载到名为dataRDD[String]

case class PlayerStats(FirstName: String, LastName: String, Country: String, matchandscore: Map[String, Int])

val result: RDD[PlayerStats] = data
  .filter(!_.startsWith("FirstName")) // remove header
  .map(_.split(",")).map { // map into case classes
    case Array(fn, ln, cntry, mn, g) => PlayerStats(fn, ln, cntry, Map(mn -> g.toInt))
  }
  .keyBy(p => (p.FirstName, p.LastName)) // key by player
  .reduceByKey((p1, p2) => p1.copy(matchandscore = p1.matchandscore ++ p2.matchandscore)) 
  .map(_._2) // remove key

Firstly convert the line into key value pair say (Cristiano, rest of data) then apply groupByKey or reduceByKey can also work then try to convert the key value pair data after applying groupByKey or reduceByKey into your class by putting value. 首先将行转换为键值对,例如(Cristiano, rest of data)然后应用groupByKeyreduceByKey也可以工作,然后在通过将value应用于groupByKey或reduceByKey到您的类后尝试将键值对数据转换。 Take a help of famous word count program. 借助著名的字数统计程序。

http://spark.apache.org/examples.html http://spark.apache.org/examples.html

You could try something as follows: 您可以尝试以下操作:

val file = sc.textFile("myfile.csv")

val df = file.map(line => line.split(",")).       // split line by comma
              filter(lineSplit => lineSplit(0) != "FirstName").  // filter out first row
              map(lineSplit => {            // transform lines
              (lineSplit(0), lineSplit(1), lineSplit(2), Map((lineSplit(3), lineSplit(4).toInt)))}).
              toDF("FirstName", "LastName", "Country", "MatchAndScore")         

df.schema
// res34: org.apache.spark.sql.types.StructType = StructType(StructField(FirstName,StringType,true), StructField(LastName,StringType,true), StructField(Country,StringType,true), StructField(MatchAndScore,MapType(StringType,IntegerType,false),true))

df.show

+---------+--------+---------+----------------+
|FirstName|LastName|  Country|   MatchAndScore|
+---------+--------+---------+----------------+
|Cristiano| Ronaldo| Portugal|Map(Match1 -> 1)|
|Cristiano| Ronaldo| Portugal|Map(Match2 -> 1)|
|Cristiano| Ronaldo| Portugal|Map(Match3 -> 0)|
|Cristiano| Ronaldo| Portugal|Map(Match4 -> 2)|
|   Lionel|   Messi|Argentina|Map(Match1 -> 1)|
|   Lionel|   Messi|Argentina|Map(Match2 -> 2)|
|   Lionel|   Messi|Argentina|Map(Match3 -> 1)|
|   Lionel|   Messi|Argentina|Map(Match4 -> 2)|
+---------+--------+---------+----------------+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM