PySpark function behaving different between mapValues and filter

Question

I'm working in spark using pyspark. I have an rdd that is of the format [(key, (num, (min, max, count))),....] when I use the lambda below

t = fullBids.filter(lambda (value, stats): (stats[2] > 10 and stats[0] < value and value < stats[1]))

it errors out with

tuple index out of range

but when I use it in a mapValues call it runs successfully, returning either True or False correctly.

ti = fullBids.mapValues(lambda (value, stats): (stats[2] > 10 and stats[0] < value and value < stats[1]))

I would expect the filter to work, but it isn't. Can someone explain what I'm missing here?

Answer 1

If you decompose your RDD format

(key, (num, (min, max, count)))

key = value
(num, (min, max, count)) = stats
num = stats[0]
(min, max, count) = stats[1]
min = stats[1][0]
max = stats[1][1]
count = stats[1][2]

So your stats[2] is out of range

Answer 2

When you call filter , value is the key of the key-value pair RDD while stats is the value of the RDD ( (num, (min, max, count)) ), that's why you have a tuple index out of range .

When you call mapValues , value is num while stats is (min, max, count) . Infact the mapValues transformation passes each value in the key-value pair RDD.

PySpark function behaving different between mapValues and filter

Question

2 answers

solution1
2 2015-07-08 15:14:37

solution2
1 ACCPTED 2015-07-08 15:13:44

PySpark function behaving different between mapValues and filter

Question

2 answers

solution1 2 2015-07-08 15:14:37

solution2 1 ACCPTED 2015-07-08 15:13:44

solution1
2 2015-07-08 15:14:37

solution2
1 ACCPTED 2015-07-08 15:13:44