Spark 中比 filter.count 更有效的方法？

Question

i have an assignment where i have an rdd in Spark with a record that looks like the following:我有一个任务，我在 Spark 中有一个 rdd，其记录如下所示：

[(id, group), {'token1', 'token2'...}]

as an example '''tokenizedTweetsByUser.take(5)''' provides:例如 '''tokenizedTweetsByUser.take(5)''' 提供：

[(('470520068', 3),  {'#berniesanders',   '#goldmansachs',   '$',   '.',   '/',   '4',   'a',   'adorned',   'bc',   'capitalist',   'class',   "doesn't",   'he',   "i'm",   'pig',   'ride',   'rigged',   'system',   'voting',   'w',   'war'}), (('2176120173', 6),  {'!',   '#america',   '#trump',   '#votetrump',   '&',   '.',   ':',   ';',   '@realdonaldtrump',   '@trumpnewmedia',   'amp',   'change',   "don't",   'get',   'htt',   'if',   "it's",   'nothing',   'out',   'rt',   'simple',   'that',   'will',   'you',   '…'}), (('145087572', 3),  {'!',   '#colorado',   '#denver',   '%',   ',',   '-',   '.',   '1',   '11am',   '1pm',   ':',   '@allonmedicare',   '@berniesanders',   '@libertea2012',   '@rockportbasset',   'america',   'and',   'capitol',   'co',   'endorse',   'for',   'herself',   'hillary',   'http',   'icymi',   'in',   'is',   'leading',   'liar',   'mst',   'only',   'out',   'positive',   'progressive',   'proof',   'rt',   's',   'state',   'that',   'the',   'to',   'today',   'voices',   'wake-up',   'weasel',   '’',   '…'}), (('23047147', 6),  {'@madworldnews',   '[',   ']',   'after',   'bernie',   'deal',   'fans',   'had',   'liberal',   'pour',   'supporter',   'tears',   'to',   'trump',   'via',   'vid',   'with'}), (('526506000', 4),  {'.',   ':',   '@justinamash',   '@tedcruz',   'calls',   'candidate',   'cartel',   'correctly',   'he',   'i',   'is',   'on',   'only',   'remaining',   'rt',   'take',   'the',   'to',   'trust',   'washington',   'what',   '…'})]

the tokens are from tweets and from a list of the top 100 tokens i need to count how many of each token is found for each group.令牌来自推文和前 100 个令牌的列表，我需要计算为每个组找到的每个令牌的数量。 there are 8 groups.有8组。

my implementation is pretty simple:我的实现很简单：

    tokenizedTweetsByUser.cache()
    groupCounts = []
    for i in range(8):
        groupCounts.append([])
        for token in tokensList:
          #the following statement take too long!
          item_count = tokenizedTweetsByUser.filter(lambda x: (x[0][1] == i) and (token in x[1])).count()
        if item_count > 0:
            groupCounts[i].append((token, item_count))

but this takes too long.但这需要很长时间。 i understand that the filter.count is going to run 800 times but since its just a filter count and we're looking for the the token in a set i expected to be fairly performant.我知道 filter.count 将运行 800 次，但由于它只是一个过滤器计数，我们正在寻找一组我希望性能相当好的令牌。

can someone suggest another method to do this that would be more performant?有人可以建议另一种更高效的方法吗？

Answer 1

it would appear that reduceByKey is far more efficient than looping over many instances of filter.count.看起来reduceByKey 比在filter.count 的多个实例上循环要有效得多。

in this case combine the group and list items into a string which will be the key so that each list item is a separate row.在这种情况下，将组和列表项组合成一个字符串，该字符串将成为键，以便每个列表项都是单独的行。 then perform reduceByKey:然后执行reduceByKey：

tokenizedTweetsByUser.flatMapValues(lambda x: x).map(lambda x: (str(x[0][1])+" "+str(x[1]), 1)).reduceByKey(lambda x, y: x + y)

magnitude times faster.数量级快了几倍。

Spark 中比 filter.count 更有效的方法？

问题描述

1 个解决方案

解决方案1
0 已采纳 2020-06-24 23:14:48

Spark 中比 filter.count 更有效的方法？

问题描述

1 个解决方案

解决方案1 0 已采纳 2020-06-24 23:14:48

解决方案1
0 已采纳 2020-06-24 23:14:48