PySpark ReduceByKey

Question

I have been trying to make it work for a while, but failed every time. 我一直在尝试使其工作一段时间，但每次都失败。 I have 2 files. 我有2个档案。 One has a list of names: 其中有一个名称列表：

Name1
Name2
Name3
Name4

The other is list of values associated with names for each day in the year over several years: 另一个是与几年中每年的每一天的名称相关联的值列表：

['0.1,0.2,0.3,0.4',
 '0.5,0.6,0.7,0.8', 
 '10,1000,0.2,5000'
  ...]

The goal is to have the output like: 目标是输出如下：

Name1: [0.1,0.5,10]
Name2: [0.2,0.6,1000]
Name3:[0.3,0.7,0.2]
Name4:[0.4,0.8,5000]

And then plot histogram for each. 然后绘制每个直方图。 I wrote a mapper that creates a list of tuples that produces the following output (this is an RDD object): 我编写了一个映射器，该映射器创建了一个元组列表，该元组列表产生以下输出（这是一个RDD对象）：

[[('Name1', [0.1]),('Name2', [0,2]),('Name3', [0.3]),('Name4', [0.4])],
[('Name1', [0.5]),('Name2', [0,6]),('Name3', [0.7]),('Name4', [0.8])],
[('Name1', [10]),('Name2', [1000]),('Name3', [0.8]),('Name4', [5000])]]

Now I need to concatenate all values for each name in a single list, but each map by key, value that I attempted returns a wrong result. 现在，我需要在一个列表中串联每个名称的所有值，但是按键的每个映射，我尝试的值都会返回错误的结果。

Answer 1

You can simply loop through each and create a dictionary from it using dict.setdefault() . 您可以简单地遍历每个对象，并使用dict.setdefault()从中创建一个字典。 Example - 范例-

>>> ll = [[('Name1', [0.1]),('Name2', [0,2]),('Name3', [0.3]),('Name4', [0.4])],
... [('Name1', [0.5]),('Name2', [0,6]),('Name3', [0.7]),('Name4', [0.8])],
... [('Name1', [10]),('Name2', [1000]),('Name3', [0.8]),('Name4', [5000])]]
>>> d = {}
>>> for i in ll:
...     for tup in i:
...             d.setdefault(tup[0],[]).extend(tup[1])
...
>>> pprint.pprint(d)
{'Name1': [0.1, 0.5, 10],
 'Name2': [0, 2, 0, 6, 1000],
 'Name3': [0.3, 0.7, 0.8],
 'Name4': [0.4, 0.8, 5000]}

For Pyspark RDD Object, try a simple reduce function such as - 对于Pyspark RDD对象，请尝试使用简单的reduce函数，例如-

func = lambda x,y: x+y

Then send this in to reduceByKey method - 然后将其发送给reduceByKey方法-

object.reduceByKey(func)

Per comments, actually the OP has a list of RDD Objects (not a single RDD Objects) , in that case you can convert the RDD objects to a list by calling .collect() and then do the logic , and then you can decide whether you want the resultant as a python dictionary or an RDD object, if you want first. 对于每个注释，OP实际上具有一个RDD对象的列表（而不是单个RDD对象），在这种情况下，您可以通过调用.collect()将RDD对象转换为列表，然后执行逻辑，然后可以确定是否如果要首先将结果作为python字典或RDD对象。 You can call dict.items() to get the key-value pairs and call sc.parrallelize . 您可以调用dict.items()来获取键值对，然后调用sc.parrallelize 。 Example - 范例-

d = {}
for i in ll:
    c = i.collect()
    for tup in i:
            d.setdefault(tup[0],[]).extend(tup[1])

rddobj = sc.parallelize(d.items())

PySpark ReduceByKey

问题描述

1 个解决方案

解决方案1
0 已采纳 2015-08-16 19:04:19

PySpark ReduceByKey

问题描述

1 个解决方案

解决方案1 0 已采纳 2015-08-16 19:04:19

解决方案1
0 已采纳 2015-08-16 19:04:19