I'm trying to apply a function that filters out certain values in a dataset based on ranges of data in another dataset. I've performed a few groupBys and joins, so the format of the parameter I'm passing into the function has two Iterables, and goes as follows:
g1 = g0.map(lambda x: timefilter(x[0]))
where x[0] is <pyspark.resultiterable.ResultIterable object at 0x23b6610>, <pyspark.resultiterable.ResultIterable object at 0x23b6310>)
When I enter the function timefilter
I need to now be able to filter out the values in x[1] based on values in x[0]. But when I try the following (on both twoList
and twoRDD
, although I just show twoList here):
def timefilter(RDDList):
oneList = list(RDDList[0])
twoList = list(RDDList[1])
twoRDD = RDDList[1]
test = twoList.filter(lambda x: x[4]=='helloworld')
return test
It gives me the following errors: AttributeError: 'ResultIterable' object has no attribute 'filter'
and then a bunch of errors after.
It seems as though I can't use filter on any format of the iterables, but feel like I'm missing something very simple. Is there a transform I'm missing in the function?
Turns out that filtering on an iterable RDD isn't possible, so I just used the python in-built filter function. Something along the lines of the following: filter(lambda x: x[1] in oneList, twoList)
.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.