简体   繁体   English

Spark 中比 filter.count 更有效的方法?

[英]more efficient method in Spark than filter.count?

i have an assignment where i have an rdd in Spark with a record that looks like the following:我有一个任务,我在 Spark 中有一个 rdd,其记录如下所示:

[(id, group), {'token1', 'token2'...}]

as an example '''tokenizedTweetsByUser.take(5)''' provides:例如 '''tokenizedTweetsByUser.take(5)''' 提供:

[(('470520068', 3),  {'#berniesanders',   '#goldmansachs',   '$',   '.',   '/',   '4',   'a',   'adorned',   'bc',   'capitalist',   'class',   "doesn't",   'he',   "i'm",   'pig',   'ride',   'rigged',   'system',   'voting',   'w',   'war'}), (('2176120173', 6),  {'!',   '#america',   '#trump',   '#votetrump',   '&',   '.',   ':',   ';',   '@realdonaldtrump',   '@trumpnewmedia',   'amp',   'change',   "don't",   'get',   'htt',   'if',   "it's",   'nothing',   'out',   'rt',   'simple',   'that',   'will',   'you',   '…'}), (('145087572', 3),  {'!',   '#colorado',   '#denver',   '%',   ',',   '-',   '.',   '1',   '11am',   '1pm',   ':',   '@allonmedicare',   '@berniesanders',   '@libertea2012',   '@rockportbasset',   'america',   'and',   'capitol',   'co',   'endorse',   'for',   'herself',   'hillary',   'http',   'icymi',   'in',   'is',   'leading',   'liar',   'mst',   'only',   'out',   'positive',   'progressive',   'proof',   'rt',   's',   'state',   'that',   'the',   'to',   'today',   'voices',   'wake-up',   'weasel',   '’',   '…'}), (('23047147', 6),  {'@madworldnews',   '[',   ']',   'after',   'bernie',   'deal',   'fans',   'had',   'liberal',   'pour',   'supporter',   'tears',   'to',   'trump',   'via',   'vid',   'with'}), (('526506000', 4),  {'.',   ':',   '@justinamash',   '@tedcruz',   'calls',   'candidate',   'cartel',   'correctly',   'he',   'i',   'is',   'on',   'only',   'remaining',   'rt',   'take',   'the',   'to',   'trust',   'washington',   'what',   '…'})]

the tokens are from tweets and from a list of the top 100 tokens i need to count how many of each token is found for each group.令牌来自推文和前 100 个令牌的列表,我需要计算为每个组找到的每个令牌的数量。 there are 8 groups.有8组。

my implementation is pretty simple:我的实现很简单:

    tokenizedTweetsByUser.cache()
    groupCounts = []
    for i in range(8):
        groupCounts.append([])
        for token in tokensList:
          #the following statement take too long!
          item_count = tokenizedTweetsByUser.filter(lambda x: (x[0][1] == i) and (token in x[1])).count()
        if item_count > 0:
            groupCounts[i].append((token, item_count))

but this takes too long.但这需要很长时间。 i understand that the filter.count is going to run 800 times but since its just a filter count and we're looking for the the token in a set i expected to be fairly performant.我知道 filter.count 将运行 800 次,但由于它只是一个过滤器计数,我们正在寻找一组我希望性能相当好的令牌。

can someone suggest another method to do this that would be more performant?有人可以建议另一种更高效的方法吗?

it would appear that reduceByKey is far more efficient than looping over many instances of filter.count.看起来reduceByKey 比在filter.count 的多个实例上循环要有效得多。

in this case combine the group and list items into a string which will be the key so that each list item is a separate row.在这种情况下,将组和列表项组合成一个字符串,该字符串将成为键,以便每个列表项都是单独的行。 then perform reduceByKey:然后执行reduceByKey:

tokenizedTweetsByUser.flatMapValues(lambda x: x).map(lambda x: (str(x[0][1])+" "+str(x[1]), 1)).reduceByKey(lambda x, y: x + y)

magnitude times faster.数量级快了几倍。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM