| tweet id | | tweet created minute | | Game start minute | | Game end minute |
1001 145678 145600 145730
1002 145678 145600 145730
1005 145680 145600 145730
12278 145687 145600 145730
765558 145688 145600 145730
724323 145689 145600 145730
875857 145688 145600 145730
79375 145685 145600 145730
84666 145686 145600 145730
335556 145687 145600 145730
29990 145688 145600 145730
56 145689 145600 145730
968867 145690 145600 145730
8452 145691 145600 145730
1334 145679 145600 145730
There are 130 minutes in this match. How do I calculate the amount of tweets per minute? "tweet id" represents a unique tweet.
Expected result format:
minutes | count of tweets |
---|---|
1 | 2 |
2 | 1 |
3 | 2 |
4 | 3 |
5 | 1 |
6 | 0 |
7 | 0 |
8 | 2 |
9 | 1 |
10 | 0 |
Assuming tweet id is unique and using Pyspark and raw rdd:
rdd = sc.parallelize([(1001 ,145678, 145600, 145730),
(1002 ,145678, 145600, 145730),
(1005 ,145680, 145600, 145730),
(12278 ,145687, 145600, 145730),
(765558 ,145688, 145600, 145730),
(724323 ,145689, 145600, 145730),
(875857 ,145688, 145600, 145730),
(79375 ,145685, 145600, 145730),
(84666 ,145686, 145600, 145730),
(335556 ,145687, 145600, 145730),
(29990 ,145688, 145600, 145730),
(56 ,145689, 145600, 145730),
(968867 ,145690, 145600, 145730),
(8452 ,145691, 145600, 145730),
(1334 ,145679, 145600, 145730) ])
result_dict = rdd.filter(lambda x: x[2] <= x[1] <= x[3]).map(lambda x: (x[1] - x[2], 0))\
.countByKey()
print "minutes count of tweets"
for i in sorted(result_dict.iteritems()):
print "{0}\t{1}".format(i[0], i[1])
Result:
minutes count of tweets
78 2
79 1
80 1
85 1
86 1
87 2
88 3
89 2
90 1
91 1
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.