flatMap在pyspark中的自定义对象列表上

Question

I'm getting an error when running flatMap() on a list of objects of a class. 在类的对象列表上运行flatMap（）时出现错误。 It works fine for regular python data types like int, list etc. but I'm facing an error when the list contains objects of my class. 它对于常规python数据类型（如int，list等）正常工作，但是当列表包含我的类的对象时，我将遇到错误。 Here's the entire code: 这是完整的代码：

from pyspark import SparkContext 

sc = SparkContext("local","WordCountBySparkKeyword")

def func(x):
    if x==2:
        return [2, 3, 4]
    return [1]

rdd = sc.parallelize([2])
rdd = rdd.flatMap(func) # rdd.collect() now has [2, 3, 4]
rdd = rdd.flatMap(func) # rdd.collect() now has [2, 3, 4, 1, 1]

print rdd.collect() # gives expected output

# Class I'm defining
class node(object):
    def __init__(self, value):
        self.value = value

    # Representation, for printing node
    def __repr__(self):
        return self.value


def foo(x):
    if x.value==2:
        return [node(2), node(3), node(4)]
    return [node(1)]

rdd = sc.parallelize([node(2)])
rdd = rdd.flatMap(foo)  #marker 2

print rdd.collect() # rdd.collect should contain nodes with values [2, 3, 4, 1, 1]

The code works fine till marker 1(commented in code). 该代码可以正常工作，直到标记1（在代码中注释）。 The problem arises after marker 2. The specific error message I'm getting is AttributeError: 'module' object has no attribute 'node' How do I resolve this error? 在标记2之后出现问题。我得到的特定错误消息是AttributeError: 'module' object has no attribute 'node'如何解决此错误？

I'm working on ubuntu, running pyspark 1.4.1 我正在运行pyspark 1.4.1的ubuntu上工作

Answer 1

Error you get is completely unrelated to flatMap . 您得到的错误与flatMap完全无关。 If you define node class in your main script it is accessible on a driver but it is not distributed to the workers. 如果在主脚本中定义node类，则可以在驱动程序上访问它，但不会将其分发给工作程序。 To make it work you should place node definition inside separate module and makes sure it is distributed to the workers. 为了使它起作用，您应该将node定义放在单独的模块中，并确保将其分发给工作程序。

Create separate module with node definition, lets call it node.py 使用node定义创建单独的模块，将其node.py
Import this node class inside your main script: 将此node类导入您的主脚本中：
```
 from node import node 
```
Make sure module is distributed to the workers: 确保将模块分发给工人：
```
 sc.addPyFile("node.py") 
```

Now everything should work as expected. 现在一切都应该按预期进行。

On a side note: 附带说明：

PEP 8 recommends CapWords for class names. PEP 8建议使用CapWords作为类名。 It is not a hard requirement but it makes life easier 这不是一个硬性要求，但它使生活更轻松
__repr__ method should return a string representation of an object . __repr__方法应返回对象的字符串表示形式。 At least make sure it is a string , but a proper representation is even better: 至少确保它是一个string ，但是适当的表示甚至更好：
```
 def __repr__(self): return "node({0})".format(repr(self.value)) 
```

flatMap在pyspark中的自定义对象列表上

问题描述

1 个解决方案

解决方案1
4 已采纳 2015-09-26 00:27:36

flatMap在pyspark中的自定义对象列表上

问题描述

1 个解决方案

解决方案1 4 已采纳 2015-09-26 00:27:36

解决方案1
4 已采纳 2015-09-26 00:27:36