简体   繁体   中英

more efficient method in Spark than filter.count?

i have an assignment where i have an rdd in Spark with a record that looks like the following:

[(id, group), {'token1', 'token2'...}]

as an example '''tokenizedTweetsByUser.take(5)''' provides:

[(('470520068', 3),  {'#berniesanders',   '#goldmansachs',   '$',   '.',   '/',   '4',   'a',   'adorned',   'bc',   'capitalist',   'class',   "doesn't",   'he',   "i'm",   'pig',   'ride',   'rigged',   'system',   'voting',   'w',   'war'}), (('2176120173', 6),  {'!',   '#america',   '#trump',   '#votetrump',   '&',   '.',   ':',   ';',   '@realdonaldtrump',   '@trumpnewmedia',   'amp',   'change',   "don't",   'get',   'htt',   'if',   "it's",   'nothing',   'out',   'rt',   'simple',   'that',   'will',   'you',   '…'}), (('145087572', 3),  {'!',   '#colorado',   '#denver',   '%',   ',',   '-',   '.',   '1',   '11am',   '1pm',   ':',   '@allonmedicare',   '@berniesanders',   '@libertea2012',   '@rockportbasset',   'america',   'and',   'capitol',   'co',   'endorse',   'for',   'herself',   'hillary',   'http',   'icymi',   'in',   'is',   'leading',   'liar',   'mst',   'only',   'out',   'positive',   'progressive',   'proof',   'rt',   's',   'state',   'that',   'the',   'to',   'today',   'voices',   'wake-up',   'weasel',   '’',   '…'}), (('23047147', 6),  {'@madworldnews',   '[',   ']',   'after',   'bernie',   'deal',   'fans',   'had',   'liberal',   'pour',   'supporter',   'tears',   'to',   'trump',   'via',   'vid',   'with'}), (('526506000', 4),  {'.',   ':',   '@justinamash',   '@tedcruz',   'calls',   'candidate',   'cartel',   'correctly',   'he',   'i',   'is',   'on',   'only',   'remaining',   'rt',   'take',   'the',   'to',   'trust',   'washington',   'what',   '…'})]

the tokens are from tweets and from a list of the top 100 tokens i need to count how many of each token is found for each group. there are 8 groups.

my implementation is pretty simple:

    tokenizedTweetsByUser.cache()
    groupCounts = []
    for i in range(8):
        groupCounts.append([])
        for token in tokensList:
          #the following statement take too long!
          item_count = tokenizedTweetsByUser.filter(lambda x: (x[0][1] == i) and (token in x[1])).count()
        if item_count > 0:
            groupCounts[i].append((token, item_count))

but this takes too long. i understand that the filter.count is going to run 800 times but since its just a filter count and we're looking for the the token in a set i expected to be fairly performant.

can someone suggest another method to do this that would be more performant?

it would appear that reduceByKey is far more efficient than looping over many instances of filter.count.

in this case combine the group and list items into a string which will be the key so that each list item is a separate row. then perform reduceByKey:

tokenizedTweetsByUser.flatMapValues(lambda x: x).map(lambda x: (str(x[0][1])+" "+str(x[1]), 1)).reduceByKey(lambda x, y: x + y)

magnitude times faster.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM