简体   繁体   中英

Unable to use filter on an iterable RDD in pyspark

I'm trying to apply a function that filters out certain values in a dataset based on ranges of data in another dataset. I've performed a few groupBys and joins, so the format of the parameter I'm passing into the function has two Iterables, and goes as follows:

g1 = g0.map(lambda x: timefilter(x[0]))

where x[0] is <pyspark.resultiterable.ResultIterable object at 0x23b6610>, <pyspark.resultiterable.ResultIterable object at 0x23b6310>)

When I enter the function timefilter I need to now be able to filter out the values in x[1] based on values in x[0]. But when I try the following (on both twoList and twoRDD , although I just show twoList here):

def timefilter(RDDList):
    oneList = list(RDDList[0])
    twoList = list(RDDList[1])
    twoRDD = RDDList[1]
    test = twoList.filter(lambda x: x[4]=='helloworld')
    return test

It gives me the following errors: AttributeError: 'ResultIterable' object has no attribute 'filter' and then a bunch of errors after.

It seems as though I can't use filter on any format of the iterables, but feel like I'm missing something very simple. Is there a transform I'm missing in the function?

Turns out that filtering on an iterable RDD isn't possible, so I just used the python in-built filter function. Something along the lines of the following: filter(lambda x: x[1] in oneList, twoList) .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM