如何使用Python類處理RDD？

Question

我在Spark中將模型實現為python類，每當我嘗試將類方法映射到RDD時，它都會失敗。 我的實際代碼更復雜，但這個簡化版本是問題的核心：

class model(object):
    def __init__(self):
        self.data = sc.textFile('path/to/data.csv')
        # other misc setup
    def run_model(self):
        self.data = self.data.map(self.transformation_function)
    def transformation_function(self,row):
        row = row.split(',')
        return row[0]+row[1]

現在，如果我像這樣運行模型（例如）：

test = model()
test.run_model()
test.data.take(10)

我收到以下錯誤：

例外：您似乎嘗試從廣播變量，操作或轉換引用SparkContext。 SparkContext只能在驅動程序上使用，而不能在工作程序上運行的代碼中使用。 有關更多信息，請參閱SPARK-5063。

我已經玩了一下這個，當我嘗試將類方法映射到類中的RDD時，它似乎可靠地發生。 我已經確認，如果我在類結構之外實現，映射函數可以正常工作，所以問題肯定與類有關。 有辦法解決這個問題嗎？

Answer 1

這里的問題比使用嵌套的RDD或在轉換中執行Spark動作更微妙。 Spark不允許訪問SparkContext內部操作或轉換。

即使您沒有顯式訪問它，它也會在閉包內被引用，並且必須被序列化並隨身攜帶。 這意味着你的transformation方法引用了self ，它也保留了SparkContext ，因此也就是錯誤。

處理此問題的一種方法是使用靜態方法：

class model(object):
    @staticmethod
    def transformation_function(row):
        row = row.split(',')
        return row[0]+row[1]

    def __init__(self):
        self.data = sc.textFile('some.csv')

    def run_model(self):
        self.data = self.data.map(model.transformation_function)

編輯：

如果您希望能夠訪問實例變量，可以嘗試這樣的事情：

class model(object):
    @staticmethod
    def transformation_function(a_model):
        delim = a_model.delim
        def _transformation_function(row):
            return row.split(delim)
        return _transformation_function

    def __init__(self):
        self.delim = ','
        self.data = sc.textFile('some.csv')

    def run_model(self):
        self.data = self.data.map(model.transformation_function(self))

如何使用Python類處理RDD？

問題描述

1 個解決方案

解決方案1
14 已采納 2015-09-11 01:30:13

如何使用Python類處理RDD？

問題描述

1 個解決方案

解決方案1 14 已采納 2015-09-11 01:30:13

解決方案1
14 已采納 2015-09-11 01:30:13