简体   繁体   中英

PySpark function behaving different between mapValues and filter

I'm working in spark using pyspark. I have an rdd that is of the format [(key, (num, (min, max, count))),....] when I use the lambda below

t = fullBids.filter(lambda (value, stats): (stats[2] > 10 and stats[0] < value and value < stats[1]))

it errors out with

tuple index out of range

but when I use it in a mapValues call it runs successfully, returning either True or False correctly.

ti = fullBids.mapValues(lambda (value, stats): (stats[2] > 10 and stats[0] < value and value < stats[1]))

I would expect the filter to work, but it isn't. Can someone explain what I'm missing here?

If you decompose your RDD format

(key, (num, (min, max, count)))

key = value
(num, (min, max, count)) = stats
num = stats[0]
(min, max, count) = stats[1]
min = stats[1][0]
max = stats[1][1]
count = stats[1][2]

So your stats[2] is out of range

When you call filter , value is the key of the key-value pair RDD while stats is the value of the RDD ( (num, (min, max, count)) ), that's why you have a tuple index out of range .

When you call mapValues , value is num while stats is (min, max, count) . Infact the mapValues transformation passes each value in the key-value pair RDD.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM