[英]How to find occurence of an event per minute within a time range using pyspark
| tweet id | | tweet created minute | | Game start minute | | Game end minute |
1001 145678 145600 145730
1002 145678 145600 145730
1005 145680 145600 145730
12278 145687 145600 145730
765558 145688 145600 145730
724323 145689 145600 145730
875857 145688 145600 145730
79375 145685 145600 145730
84666 145686 145600 145730
335556 145687 145600 145730
29990 145688 145600 145730
56 145689 145600 145730
968867 145690 145600 145730
8452 145691 145600 145730
1334 145679 145600 145730
這場比賽有130分鍾。 如何計算每分鍾的推文數量? “tweet id”代表一條獨特的推文。
預期結果格式:
分鍾 | 推文數量 |
---|---|
1個 | 2個 |
2個 | 1個 |
3個 | 2個 |
4個 | 3個 |
5個 | 1個 |
6個 | 0 |
7 | 0 |
8個 | 2個 |
9 | 1個 |
10 | 0 |
假設 tweet id 是唯一的並且使用 Pyspark 和原始 rdd:
rdd = sc.parallelize([(1001 ,145678, 145600, 145730),
(1002 ,145678, 145600, 145730),
(1005 ,145680, 145600, 145730),
(12278 ,145687, 145600, 145730),
(765558 ,145688, 145600, 145730),
(724323 ,145689, 145600, 145730),
(875857 ,145688, 145600, 145730),
(79375 ,145685, 145600, 145730),
(84666 ,145686, 145600, 145730),
(335556 ,145687, 145600, 145730),
(29990 ,145688, 145600, 145730),
(56 ,145689, 145600, 145730),
(968867 ,145690, 145600, 145730),
(8452 ,145691, 145600, 145730),
(1334 ,145679, 145600, 145730) ])
result_dict = rdd.filter(lambda x: x[2] <= x[1] <= x[3]).map(lambda x: (x[1] - x[2], 0))\
.countByKey()
print "minutes count of tweets"
for i in sorted(result_dict.iteritems()):
print "{0}\t{1}".format(i[0], i[1])
結果:
minutes count of tweets
78 2
79 1
80 1
85 1
86 1
87 2
88 3
89 2
90 1
91 1
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.