How can I reduceByKey count occurances of column value in column list?

Question

I have an RDD userMovies

 userID, movieID, list of movieIDs
[(69120, (296, [296, 356, 2858, 608, 588, 1580, 597, 153, 4306, 587]))],...

I wish to count the amount of times the 2nd column value is in the list in the 3rd column.

userMovies.reduceByKey(lambda v : 1 if v[1][0] in v[1][1] else 0).take(1)

I tried to have an RDD by adding 1 or 0 through a reduceByKey and after that sum the total RDD value, as in summing up all the 1s. But the reduceByKey returns the same RDD back and doesn't give a 1 or 0.

EDIT:

userMovies.map(lambda tup: (tup[0], (tup[1][0], 1 if tup[1][0] in tup[1][1] else 0))).reduceByKey(lambda a,b: a[1][1]+b[1][1]).take(2)

RETURNS

[(43450, (84152, 0)), (60830, (345, 0))]

I need only one row-column [(totalsum)] not per key

Answer 1

Have you tried using just Counter() ?

from collections import Counter

a = [(69120, (296, [296, 356, 2858, 608, 588, 1580, 597, 153, 4306, 587]))]
k = a[0][1][0]
i = a[0][1][1]
r = Counter(i)[k] #occurrences of k in i also i.count(k) would be ok.

>>> print(k, r)
296 1

Answer 2

If you map your RDD to the second tuple element, you should be able to reduce by key:

rdd = sc.parallelize([(69120, (296, \
        [296, 356, 2858, 608, 588, 1580, 597, 153, 4306, 587]))])

rdd = rdd.map(lambda tup: (tup[0], (tup[1][0], 1 if tup[1][0] in tup[1][1] else 0)))

This keeps the same tuple structure, but replaces the movie list with the number of occurences of movie ID in it:

dd.collect()

That code outputs:

[(69120, (296, 1))]

That is (userID, (movieID, 1 if found else 0))

If you need to compute the total number of views per movie (for all users):

rdd.map(lambda l: l[1])\
   .reduceByKey(lambda a,b: a+b)\
   .collect()

With this collection, the result will be similar for the single movie:

[(296, 1)]

How can I reduceByKey count occurances of column value in column list?

Question

2 answers

solution1
0 2018-04-04 10:34:49

solution2
0 ACCPTED 2018-04-04 10:34:54

How can I reduceByKey count occurances of column value in column list?

Question

2 answers

solution1 0 2018-04-04 10:34:49

solution2 0 ACCPTED 2018-04-04 10:34:54

solution1
0 2018-04-04 10:34:49

solution2
0 ACCPTED 2018-04-04 10:34:54