Unable to use filter on an iterable RDD in pyspark

Question

I'm trying to apply a function that filters out certain values in a dataset based on ranges of data in another dataset. I've performed a few groupBys and joins, so the format of the parameter I'm passing into the function has two Iterables, and goes as follows:

g1 = g0.map(lambda x: timefilter(x[0]))

where x[0] is <pyspark.resultiterable.ResultIterable object at 0x23b6610>, <pyspark.resultiterable.ResultIterable object at 0x23b6310>)

When I enter the function timefilter I need to now be able to filter out the values in x[1] based on values in x[0]. But when I try the following (on both twoList and twoRDD , although I just show twoList here):

def timefilter(RDDList):
    oneList = list(RDDList[0])
    twoList = list(RDDList[1])
    twoRDD = RDDList[1]
    test = twoList.filter(lambda x: x[4]=='helloworld')
    return test

It gives me the following errors: AttributeError: 'ResultIterable' object has no attribute 'filter' and then a bunch of errors after.

It seems as though I can't use filter on any format of the iterables, but feel like I'm missing something very simple. Is there a transform I'm missing in the function?

Answer 1

Turns out that filtering on an iterable RDD isn't possible, so I just used the python in-built filter function. Something along the lines of the following: filter(lambda x: x[1] in oneList, twoList) .

Unable to use filter on an iterable RDD in pyspark

Question

1 answers

solution1
0 ACCPTED 2016-09-01 18:46:40

Unable to use filter on an iterable RDD in pyspark

Question

1 answers

solution1 0 ACCPTED 2016-09-01 18:46:40

solution1
0 ACCPTED 2016-09-01 18:46:40