简体   繁体   中英

How can I reduceByKey count occurances of column value in column list?

I have an RDD userMovies

 userID, movieID, list of movieIDs
[(69120, (296, [296, 356, 2858, 608, 588, 1580, 597, 153, 4306, 587]))],...

I wish to count the amount of times the 2nd column value is in the list in the 3rd column.

userMovies.reduceByKey(lambda v : 1 if v[1][0] in v[1][1] else 0).take(1)

I tried to have an RDD by adding 1 or 0 through a reduceByKey and after that sum the total RDD value, as in summing up all the 1s. But the reduceByKey returns the same RDD back and doesn't give a 1 or 0.

EDIT:

userMovies.map(lambda tup: (tup[0], (tup[1][0], 1 if tup[1][0] in tup[1][1] else 0))).reduceByKey(lambda a,b: a[1][1]+b[1][1]).take(2)

RETURNS

[(43450, (84152, 0)), (60830, (345, 0))]

I need only one row-column [(totalsum)] not per key

Have you tried using just Counter() ?

from collections import Counter

a = [(69120, (296, [296, 356, 2858, 608, 588, 1580, 597, 153, 4306, 587]))]
k = a[0][1][0]
i = a[0][1][1]
r = Counter(i)[k] #occurrences of k in i also i.count(k) would be ok.

>>> print(k, r)
296 1

If you map your RDD to the second tuple element, you should be able to reduce by key:

rdd = sc.parallelize([(69120, (296, \
        [296, 356, 2858, 608, 588, 1580, 597, 153, 4306, 587]))])

rdd = rdd.map(lambda tup: (tup[0], (tup[1][0], 1 if tup[1][0] in tup[1][1] else 0)))

This keeps the same tuple structure, but replaces the movie list with the number of occurences of movie ID in it:

dd.collect()

That code outputs:

[(69120, (296, 1))]

That is (userID, (movieID, 1 if found else 0))

If you need to compute the total number of views per movie (for all users):

rdd.map(lambda l: l[1])\
   .reduceByKey(lambda a,b: a+b)\
   .collect()

With this collection, the result will be similar for the single movie:

[(296, 1)]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM