[英]flatMap over list of custom objects in pyspark
I'm getting an error when running flatMap() on a list of objects of a class. 在类的对象列表上运行flatMap()时出现错误。 It works fine for regular python data types like int, list etc. but I'm facing an error when the list contains objects of my class.
它对于常规python数据类型(如int,list等)正常工作,但是当列表包含我的类的对象时,我将遇到错误。 Here's the entire code:
这是完整的代码:
from pyspark import SparkContext
sc = SparkContext("local","WordCountBySparkKeyword")
def func(x):
if x==2:
return [2, 3, 4]
return [1]
rdd = sc.parallelize([2])
rdd = rdd.flatMap(func) # rdd.collect() now has [2, 3, 4]
rdd = rdd.flatMap(func) # rdd.collect() now has [2, 3, 4, 1, 1]
print rdd.collect() # gives expected output
# Class I'm defining
class node(object):
def __init__(self, value):
self.value = value
# Representation, for printing node
def __repr__(self):
return self.value
def foo(x):
if x.value==2:
return [node(2), node(3), node(4)]
return [node(1)]
rdd = sc.parallelize([node(2)])
rdd = rdd.flatMap(foo) #marker 2
print rdd.collect() # rdd.collect should contain nodes with values [2, 3, 4, 1, 1]
The code works fine till marker 1(commented in code). 该代码可以正常工作,直到标记1(在代码中注释)。 The problem arises after marker 2. The specific error message I'm getting is
AttributeError: 'module' object has no attribute 'node'
How do I resolve this error? 在标记2之后出现问题。我得到的特定错误消息是
AttributeError: 'module' object has no attribute 'node'
如何解决此错误?
I'm working on ubuntu, running pyspark 1.4.1 我正在运行pyspark 1.4.1的ubuntu上工作
Error you get is completely unrelated to flatMap
. 您得到的错误与
flatMap
完全无关。 If you define node
class in your main script it is accessible on a driver but it is not distributed to the workers. 如果在主脚本中定义
node
类,则可以在驱动程序上访问它,但不会将其分发给工作程序。 To make it work you should place node
definition inside separate module and makes sure it is distributed to the workers. 为了使它起作用,您应该将
node
定义放在单独的模块中,并确保将其分发给工作程序。
node
definition, lets call it node.py
node
定义创建单独的模块,将其node.py
Import this node
class inside your main script: 将此
node
类导入您的主脚本中:
from node import node
Make sure module is distributed to the workers: 确保将模块分发给工人:
sc.addPyFile("node.py")
Now everything should work as expected. 现在一切都应该按预期进行。
On a side note: 附带说明:
__repr__
method should return a string representation of an object . __repr__
方法应返回对象的字符串表示形式 。 At least make sure it is a string
, but a proper representation is even better: 至少确保它是一个
string
,但是适当的表示甚至更好:
def __repr__(self): return "node({0})".format(repr(self.value))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.