简体   繁体   English

如何使用Python类处理RDD?

[英]How to process RDDs using a Python class?

I'm implementing a model in Spark as a python class, and any time I try to map a class method to an RDD it fails. 我在Spark中将模型实现为python类,每当我尝试将类方法映射到RDD时,它都会失败。 My actual code is more complicated, but this simplified version gets at the heart of the problem: 我的实际代码更复杂,但这个简化版本是问题的核心:

class model(object):
    def __init__(self):
        self.data = sc.textFile('path/to/data.csv')
        # other misc setup
    def run_model(self):
        self.data = self.data.map(self.transformation_function)
    def transformation_function(self,row):
        row = row.split(',')
        return row[0]+row[1]

Now, if I run the model like so (for example): 现在,如果我像这样运行模型(例如):

test = model()
test.run_model()
test.data.take(10)

I get the following error: 我收到以下错误:

Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transforamtion. 例外:您似乎尝试从广播变量,操作或转换引用SparkContext。 SparkContext can only be used on the driver, not in code that it run on workers. SparkContext只能在驱动程序上使用,而不能在工作程序上运行的代码中使用。 For more information, see SPARK-5063. 有关更多信息,请参阅SPARK-5063。

I've played with this a bit, and it seems to reliably occur anytime I try to map a class method to an RDD within the class. 我已经玩了一下这个,当我尝试将类方法映射到类中的RDD时,它似乎可靠地发生。 I have confirmed that the mapped function works fine if I implement outside of a class structure, so the problem definitely has to do with the class. 我已经确认,如果我在类结构之外实现,映射函数可以正常工作,所以问题肯定与类有关。 Is there a way to resolve this? 有办法解决这个问题吗?

Problem here is a little bit more subtle than using nested RDDs or performing Spark actions inside of transformations . 这里的问题比使用嵌套的RDD或在转换中执行Spark动作更微妙。 Spark doesn't allow access to the SparkContext inside action or transformation. Spark不允许访问SparkContext内部操作或转换。

Even you don't access it explicitly it is referenced inside the closure and has to be serialized and carried around. 即使您没有显式访问它,它也会在闭包内被引用,并且必须被序列化并随身携带。 It means that your transformation method, which references self , keeps SparkContext as well, hence the error. 这意味着你的transformation方法引用了self ,它也保留了SparkContext ,因此也就是错误。

One way to handle this is to use static method: 处理此问题的一种方法是使用静态方法:

class model(object):
    @staticmethod
    def transformation_function(row):
        row = row.split(',')
        return row[0]+row[1]

    def __init__(self):
        self.data = sc.textFile('some.csv')

    def run_model(self):
        self.data = self.data.map(model.transformation_function)

Edit : 编辑

If you want to be able to access instance variables you can try something like this: 如果您希望能够访问实例变量,可以尝试这样的事情:

class model(object):
    @staticmethod
    def transformation_function(a_model):
        delim = a_model.delim
        def _transformation_function(row):
            return row.split(delim)
        return _transformation_function

    def __init__(self):
        self.delim = ','
        self.data = sc.textFile('some.csv')

    def run_model(self):
        self.data = self.data.map(model.transformation_function(self))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM