[英]Scala spark reduce by key and find common value
我有一个csv数据文件存储在HDFS上的sequenceFile中,格式为name, zip, country, fav_food1, fav_food2, fav_food3, fav_colour
。 可能有许多具有相同名称的条目,我需要找出他们最喜欢的食物是什么(即计算所有具有该名称的记录中的所有食物条目并返回最受欢迎的食物。我是Scala和Spark的新手并拥有彻底的多个教程和搜索论坛,但我仍然坚持如何继续。到目前为止,我已经得到了文本到字符串格式的序列文件,然后过滤了条目
以下是文件中一行的示例数据条目
Bob,123,USA,Pizza,Soda,,Blue
Bob,456,UK,Chocolate,Cheese,Soda,Green
Bob,12,USA,Chocolate,Pizza,Soda,Yellow
Mary,68,USA,Chips,Pasta,Chocolate,Blue
所以输出应该是元组(Bob,Soda),因为苏打在Bob的条目中出现次数最多。
import org.apache.hadoop.io._
var lines = sc.sequenceFile("path",classOf[LongWritable],classOf[Text]).values.map(x => x.toString())
// converted to string since I could not get filter to run on Text and removing the longwritable
var filtered = lines.filter(_.split(",")(0) == "Bob");
// removed entries with all other users
var f_tuples = filtered.map(line => lines.split(",");
// split all the values
var f_simple = filtered.map(line => (line(0), (line(3), line(4), line(5))
// removed unnecessary fields
我现在的问题是,我认为我有[<name,[f,f,f]>]
结构,并且不知道如何进行压扁并获得最受欢迎的食物。 我需要组合所有条目,所以我有一个带a的条目,然后获取值中最常见的元素。 任何帮助,将不胜感激。 谢谢
我试过这个让它变得扁平化,但似乎我尝试的越多,数据结构就越复杂。
var f_trial = fpairs.groupBy(_._1).mapValues(_.map(_._2))
// the resulting structure was of type org.apache.spark.rdd.RDD[(String, Interable[(String, String, String)]
这是一个记录的println在f_trial之后的样子
("Bob", List((Pizza, Soda,), (Chocolate, Cheese, Soda), (Chocolate, Pizza, Soda)))
括号细分
("Bob",
List(
(Pizza, Soda, <missing value>),
(Chocolate, Cheese, Soda),
(Chocolate, Pizza, Soda)
) // ends List paren
) // ends first paren
我找时间了。 设定:
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
val conf = new SparkConf().setAppName("spark-scratch").setMaster("local")
val sc = new SparkContext(conf)
val data = """
Bob,123,USA,Pizza,Soda,,Blue
Bob,456,UK,Chocolate,Cheese,Soda,Green
Bob,12,USA,Chocolate,Pizza,Soda,Yellow
Mary,68,USA,Chips,Pasta,Chocolate,Blue
""".trim
val records = sc.parallelize(data.split('\n'))
提取食物选择,并为每个人做一个元组((name, food), 1)
val r2 = records.flatMap { r =>
val Array(name, id, country, food1, food2, food3, color) = r.split(',');
List(((name, food1), 1), ((name, food2), 1), ((name, food3), 1))
}
每个名称/食物组合总计:
val r3 = r2.reduceByKey((x, y) => x + y)
重新映射,以便名称(仅)是关键
val r4 = r3.map { case ((name, food), total) => (name, (food, total)) }
选择每一步计数最多的食物
val res = r4.reduceByKey((x, y) => if (y._2 > x._2) y else x)
我们已经完成了
println(res.collect().mkString)
//(Mary,(Chips,1))(Bob,(Soda,3))
编辑:要收集所有人数最多的食品,我们只需更改最后两行:
从包含总计的项目列表开始:
val r5 = r3.map { case ((name, food), total) => (name, (List(food), total)) }
在相同的情况下,将食品项目列表与该分数连接起来
val res2 = r5.reduceByKey((x, y) => if (y._2 > x._2) y
else if (y._2 < x._2) x
else (y._1:::x._1, y._2))
//(Mary,(List(Chocolate, Pasta, Chips),1))
//(Bob,(List(Soda),3))
如果你想要top-3,比如说,然后使用aggregateByKey
来组合每个人最喜欢的食物列表而不是第二个reduceByKey
Paul和mattinbits提供的解决方案将您的数据洗牌两次 - 一次执行名称和食物减少,一次减少名称。 只用一次shuffle就可以解决这个问题。
/**Generate key-food_count pairs from a splitted line**/
def bitsToKeyMapPair(xs: Array[String]): (String, Map[String, Long]) = {
val key = xs(0)
val map = xs
.drop(3) // Drop name..country
.take(3) // Take food
.filter(_.trim.size !=0) // Ignore empty
.map((_, 1L)) // Generate k-v pairs
.toMap // Convert to Map
.withDefaultValue(0L) // Set default
(key, map)
}
/**Combine two count maps**/
def combine(m1: Map[String, Long], m2: Map[String, Long]): Map[String, Long] = {
(m1.keys ++ m2.keys).map(k => (k, m1(k) + m2(k))).toMap.withDefaultValue(0L)
}
val n: Int = ??? // Number of favorite per user
val records = lines.map(line => bitsToKeyMapPair(line.split(",")))
records.reduceByKey(combine).mapValues(_.toSeq.sortBy(-_._2).take(n))
如果您不是纯粹主义者,则可以使用scala.collection.immutable.Map
替换scala.collection.mutable.Map
以进一步提高性能。
这是一个完整的例子:
import org.apache.spark.{SparkContext, SparkConf}
object Main extends App {
val data = List(
"Bob,123,USA,Pizza,Soda,,Blue",
"Bob,456,UK,Chocolate,Cheese,Soda,Green",
"Bob,12,USA,Chocolate,Pizza,Soda,Yellow",
"Mary,68,USA,Chips,Pasta,Chocolate,Blue")
val sparkConf = new SparkConf().setMaster("local").setAppName("example")
val sc = new SparkContext(sparkConf)
val lineRDD = sc.parallelize(data)
val pairedRDD = lineRDD.map { line =>
val fields = line.split(",")
(fields(0), List(fields(3), fields(4), fields(5)).filter(_ != ""))
}.filter(_._1 == "Bob")
/*pairedRDD.collect().foreach(println)
(Bob,List(Pizza, Soda))
(Bob,List(Chocolate, Cheese, Soda))
(Bob,List(Chocolate, Pizza, Soda))
*/
val flatPairsRDD = pairedRDD.flatMap {
case (name, foodList) => foodList.map(food => ((name, food), 1))
}
/*flatPairsRDD.collect().foreach(println)
((Bob,Pizza),1)
((Bob,Soda),1)
((Bob,Chocolate),1)
((Bob,Cheese),1)
((Bob,Soda),1)
((Bob,Chocolate),1)
((Bob,Pizza),1)
((Bob,Soda),1)
*/
val nameFoodSumRDD = flatPairsRDD.reduceByKey((a, b) => a + b)
/*nameFoodSumRDD.collect().foreach(println)
((Bob,Cheese),1)
((Bob,Soda),3)
((Bob,Pizza),2)
((Bob,Chocolate),2)
*/
val resultsRDD = nameFoodSumRDD.map{
case ((name, food), count) => (name, (food,count))
}.groupByKey.map{
case (name, foodCountList) => (name, foodCountList.toList.sortBy(_._2).reverse.head)
}
resultsRDD.collect().foreach(println)
/*
(Bob,(Soda,3))
*/
sc.stop()
}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.