简体   繁体   English

如何reduceByKey?

[英]How to reduceByKey?

I'm using the Spark console in the Cloudera QuickStart VM. 我正在Cloudera QuickStart VM中使用Spark控制台。

An output file is provided below. 下面提供了一个输出文件。 It displays the first 20 records. 它显示前20条记录。 Each record is a combination of a TV channel name and its corresponding count of viewers. 每条记录都是电视频道名称及其相应观看者数量的组合。 There are several hundred records. 有几百条记录。

The goal is to group this RDD (channel_views) by the TV channel name so that each record is a unique display of the TV channel name along with the sum of its corresponding count of viewers. 目标是按电视频道名称对RDD(channel_views)进行分组,以便每个记录都是电视频道名称及其对应观看者总数的唯一显示。

channel_views = joined_dataset.map(extract_channel_views)

Below is the set of codes I'm struggling with to produce the desired output/goal described above 以下是我为了产生所需的输出/目标而苦苦挣扎的一组代码

def some_function(a,b):
  some_result = a + b
  return some_result

channel_views.reduceByKey(some_function).collect()

Output of below code: 以下代码的输出:

channel_views.take(20)

[(1038, u'DEF'),  
 (1038, u'CNO'),  
 (1038, u'CNO'),  
 (1038, u'NOX'),  
 (1038, u'MAN'),  
 (1038, u'MAN'),  
 (1038, u'XYZ'),  
 (1038, u'BAT'),  
 (1038, u'CAB'),  
 (1038, u'DEF'),  
 (415, u'DEF'),  
 (415, u'CNO'),  
 (415, u'CNO'),  
 (415, u'NOX'),  
 (415, u'MAN'),  
 (415, u'MAN'),  
 (415, u'XYZ'),  
 (415, u'BAT'),  
 (415, u'CAB'),  
 (415, u'DEF')]

You are working off of a dataset that is backwards. 您正在处理向后的数据集。 Use map (or change your extract) to swap the tuples from (count,name) to (name, count) 使用map (或更改您的提取)将元组从(count,name)交换为(name, count)

The byKey methods use the first item from the tuple as the key, so you're code will concat strings, keying on count as is. byKey方法使用元组中的第一项作为键,因此您的代码将连接字符串,按原样键入计数。

I don't know python so I did this in Scala. 我不了解python,所以我在Scala中做到了。 You can convert to python. 您可以转换为python。 So here you go 所以你去

scala> val input = sc.parallelize(Seq((1038, "DEF"),
     | (1038, "CNO"),
     | (1038, "CNO"),
     | (1038, "NOX"),
     | (1038, "MAN"),
     | (1038, "MAN"),
     | (1038, "XYZ"),
     | (1038, "BAT"),
     | (1038, "CAB"),
     | (1038, "DEF"),
     | (415, "DEF"),
     | (415, "CNO"),
     | (415, "CNO"),
     | (415, "NOX"),
     | (415, "MAN"),
     | (415, "MAN"),
     | (415, "XYZ"),
     | (415, "BAT"),
     | (415, "CAB"),
     | (415, "DEF"))
     | )
input: org.apache.spark.rdd.RDD[(Int, String)] = ParallelCollectionRDD[12] at parallelize at <console>:22

scala> val data = input.map( v => (v._2,v._1) )
data: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[13] at map at <console>:24

scala> data.foreach(println)
(BAT,1038)
(DEF,415)
(CNO,415)
(BAT,415)
(CAB,415)
(DEF,415)
(MAN,1038)
(XYZ,1038)
(CNO,1038)
(NOX,1038)
(DEF,1038)
(MAN,1038)
(CNO,415)
(MAN,415)
(CAB,1038)
(XYZ,415)
(NOX,415)
(CNO,1038)
(MAN,415)
(DEF,1038)

scala> val result = data.reduceByKey( (x,y) => x+y)
result: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[14] at reduceByKey at <console>:26

scala> result.foreach(println)
(NOX,1453)
(MAN,2906)
(CNO,2906)
(CAB,1453)
(DEF,2906)
(BAT,1453)
(XYZ,1453)

scala>

这是pyspark代码:

for i in channel_views.map(lambda rec: (rec[0], rec[1])).reduceByKey(lambda acc, value: acc+value): print(i)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM