如何使用 pyspark 在时间范围内查找每分钟发生的事件

Question

| tweet id |  | tweet created minute |  | Game start minute |  | Game end minute |         
1001      145678                145600             145730   
1002      145678                145600             145730   
1005      145680                145600             145730   
12278     145687                145600             145730     
765558    145688                145600             145730     
724323    145689                145600             145730     
875857    145688                145600             145730     
79375     145685                145600             145730     
84666     145686                145600             145730     
335556    145687                145600             145730     
29990     145688                145600             145730     
56        145689                145600             145730 
968867    145690                145600             145730     
8452      145691                145600             145730   
1334      145679                145600             145730

There are 130 minutes in this match.这场比赛有130分钟。 How do I calculate the amount of tweets per minute?如何计算每分钟的推文数量？ "tweet id" represents a unique tweet. “tweet id”代表一条独特的推文。

Expected result format:预期结果格式：

minutes分钟	count of tweets推文数量
1 1个	2 2个
2 2个	1 1个
3 3个	2 2个
4 4个	3 3个
5 5个	1 1个
6 6个	0 0
7 7	0 0
8 8个	2 2个
9 9	1 1个
10 10	0 0

Answer 1

Assuming tweet id is unique and using Pyspark and raw rdd:假设 tweet id 是唯一的并且使用 Pyspark 和原始 rdd：

rdd = sc.parallelize([(1001 ,145678, 145600, 145730),
(1002 ,145678, 145600, 145730),
(1005 ,145680, 145600, 145730), 
(12278 ,145687, 145600, 145730), 
(765558 ,145688, 145600, 145730), 
(724323 ,145689, 145600, 145730), 
(875857 ,145688, 145600, 145730), 
(79375 ,145685, 145600, 145730), 
(84666 ,145686, 145600, 145730), 
(335556 ,145687, 145600, 145730), 
(29990 ,145688, 145600, 145730), 
(56 ,145689, 145600, 145730), 
(968867 ,145690, 145600, 145730), 
(8452 ,145691, 145600, 145730), 
(1334 ,145679, 145600, 145730) ])

result_dict = rdd.filter(lambda x: x[2] <= x[1] <= x[3]).map(lambda x: (x[1] - x[2], 0))\
.countByKey()

print "minutes count of tweets"
for i in sorted(result_dict.iteritems()):
    print "{0}\t{1}".format(i[0], i[1])

Result:结果：

minutes count of tweets
78  2
79  1
80  1
85  1
86  1
87  2
88  3
89  2
90  1
91  1

如何使用 pyspark 在时间范围内查找每分钟发生的事件

问题描述

1 个解决方案

解决方案1
0 已采纳 2018-04-30 12:00:00

如何使用 pyspark 在时间范围内查找每分钟发生的事件

问题描述

1 个解决方案

解决方案1 0 已采纳 2018-04-30 12:00:00

解决方案1
0 已采纳 2018-04-30 12:00:00