I have an RDD userMovies
userID, movieID, list of movieIDs
[(69120, (296, [296, 356, 2858, 608, 588, 1580, 597, 153, 4306, 587]))],...
I wish to count the amount of times the 2nd column value is in the list in the 3rd column.
userMovies.reduceByKey(lambda v : 1 if v[1][0] in v[1][1] else 0).take(1)
I tried to have an RDD by adding 1 or 0 through a reduceByKey and after that sum the total RDD value, as in summing up all the 1s. But the reduceByKey returns the same RDD back and doesn't give a 1 or 0.
EDIT:
userMovies.map(lambda tup: (tup[0], (tup[1][0], 1 if tup[1][0] in tup[1][1] else 0))).reduceByKey(lambda a,b: a[1][1]+b[1][1]).take(2)
RETURNS
[(43450, (84152, 0)), (60830, (345, 0))]
I need only one row-column [(totalsum)] not per key
Have you tried using just Counter() ?
from collections import Counter
a = [(69120, (296, [296, 356, 2858, 608, 588, 1580, 597, 153, 4306, 587]))]
k = a[0][1][0]
i = a[0][1][1]
r = Counter(i)[k] #occurrences of k in i also i.count(k) would be ok.
>>> print(k, r)
296 1
If you map your RDD to the second tuple element, you should be able to reduce by key:
rdd = sc.parallelize([(69120, (296, \
[296, 356, 2858, 608, 588, 1580, 597, 153, 4306, 587]))])
rdd = rdd.map(lambda tup: (tup[0], (tup[1][0], 1 if tup[1][0] in tup[1][1] else 0)))
This keeps the same tuple structure, but replaces the movie list with the number of occurences of movie ID in it:
dd.collect()
That code outputs:
[(69120, (296, 1))]
That is (userID, (movieID, 1 if found else 0))
If you need to compute the total number of views per movie (for all users):
rdd.map(lambda l: l[1])\
.reduceByKey(lambda a,b: a+b)\
.collect()
With this collection, the result will be similar for the single movie:
[(296, 1)]
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.