SPARK STREAMING（PYTHON）中的％wordcount

Question

In the next example I´m receiving from Kafka a sequence words: 在下一个示例中，我从Kafka收到一个序列字：

('cat')
('dog')
('rat')
('dog')

My objetive is calculate the % historic of each word. 我的目标是计算每个单词的历史百分比。 I will have two RDDs, one with the historic wordcount and another with the total of all words: 我将有两个RDD，一个具有历史单词计数，另一个具有所有单词总数：

values = KafkaUtils.createDirectStream(ssc, [topic], {"metadata.broker.list": brokers})


def updatefunc (new_value, last_value):
    if last_value is None:
        last_value = 0
    return sum(new_value, last_value)


words=values.map(lambda x: (x,1)).reduceByKey(lambda a,b: a+b)

historic= words.updateStateByKey(updatefunc).\
    transform(lambda  rdd: rdd.sortBy(lambda (x,v): x))

totalNo = words.\
    map(lambda x: x[1]).reduce(lambda a,b:a+b).map(lambda x: (('totalsum',x))).updateStateByKey(updatefunc).map(lambda x:x[1])

Now I'm trying to divide: ((historic value of each key)/totalNo)*100 to have the percentages of each word: 现在，我试图除以：（（（每个键的历史值）/ totalNo）* 100以得到每个单词的百分比：

solution=historic.map(lambda x: x[0],x[1]*100/totalNo)

But I get the error: 但是我得到了错误：

 It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transforamtion. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063

How can I fix the value of totalNO to use it to operate in another RDDs? 如何确定totalNO的值以在另一个RDD中使用它？

Answer 1

Finally this way can work as well: 最后，这种方式也可以工作：

words = KafkaUtils.createDirectStream(ssc, topics=['test'], kafkaParams={'bootstrap.servers': 'localhost:9092'})\
    .map(lambda x: x[1]).flatMap(lambda x: list(x))

historic = words.map(lambda x: (x, 1)).updateStateByKey(lambda x, y: sum(x) + (y or 0))

def func(rdd):
    if not rdd.isEmpty():
        totalNo = rdd.map(lambda x: x[1]).reduce(add)
        rdd = rdd.map(lambda x: (x[0], x[1] / totalNo))
    return rdd

solution = historic.transform(func)

solution.pprint()

Is this what you want? 这是你想要的吗？

SPARK STREAMING（PYTHON）中的％wordcount

问题描述

1 个解决方案

解决方案1
0 已采纳 2016-12-13 09:03:55

SPARK STREAMING（PYTHON）中的％wordcount

问题描述

1 个解决方案

解决方案1 0 已采纳 2016-12-13 09:03:55

解决方案1
0 已采纳 2016-12-13 09:03:55