[英]How to find occurence of an event per minute within a time range using pyspark
| tweet id | | tweet created minute | | Game start minute | | Game end minute |
1001 145678 145600 145730
1002 145678 145600 145730
1005 145680 145600 145730
12278 145687 145600 145730
765558 145688 145600 145730
724323 145689 145600 145730
875857 145688 145600 145730
79375 145685 145600 145730
84666 145686 145600 145730
335556 145687 145600 145730
29990 145688 145600 145730
56 145689 145600 145730
968867 145690 145600 145730
8452 145691 145600 145730
1334 145679 145600 145730
There are 130 minutes in this match.这场比赛有130分钟。 How do I calculate the amount of tweets per minute?
如何计算每分钟的推文数量? "tweet id" represents a unique tweet.
“tweet id”代表一条独特的推文。
Expected result format:预期结果格式:
minutes![]() |
count of tweets![]() |
---|---|
1 ![]() |
2 ![]() |
2 ![]() |
1 ![]() |
3 ![]() |
2 ![]() |
4 ![]() |
3 ![]() |
5 ![]() |
1 ![]() |
6 ![]() |
0 ![]() |
7 ![]() |
0 ![]() |
8 ![]() |
2 ![]() |
9 ![]() |
1 ![]() |
10 ![]() |
0 ![]() |
Assuming tweet id is unique and using Pyspark and raw rdd:假设 tweet id 是唯一的并且使用 Pyspark 和原始 rdd:
rdd = sc.parallelize([(1001 ,145678, 145600, 145730),
(1002 ,145678, 145600, 145730),
(1005 ,145680, 145600, 145730),
(12278 ,145687, 145600, 145730),
(765558 ,145688, 145600, 145730),
(724323 ,145689, 145600, 145730),
(875857 ,145688, 145600, 145730),
(79375 ,145685, 145600, 145730),
(84666 ,145686, 145600, 145730),
(335556 ,145687, 145600, 145730),
(29990 ,145688, 145600, 145730),
(56 ,145689, 145600, 145730),
(968867 ,145690, 145600, 145730),
(8452 ,145691, 145600, 145730),
(1334 ,145679, 145600, 145730) ])
result_dict = rdd.filter(lambda x: x[2] <= x[1] <= x[3]).map(lambda x: (x[1] - x[2], 0))\
.countByKey()
print "minutes count of tweets"
for i in sorted(result_dict.iteritems()):
print "{0}\t{1}".format(i[0], i[1])
Result:结果:
minutes count of tweets
78 2
79 1
80 1
85 1
86 1
87 2
88 3
89 2
90 1
91 1
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.