简体   繁体   English

如何使用 pyspark 在时间范围内查找每分钟发生的事件

[英]How to find occurence of an event per minute within a time range using pyspark

| tweet id |  | tweet created minute |  | Game start minute |  | Game end minute |         
1001      145678                145600             145730   
1002      145678                145600             145730   
1005      145680                145600             145730   
12278     145687                145600             145730     
765558    145688                145600             145730     
724323    145689                145600             145730     
875857    145688                145600             145730     
79375     145685                145600             145730     
84666     145686                145600             145730     
335556    145687                145600             145730     
29990     145688                145600             145730     
56        145689                145600             145730 
968867    145690                145600             145730     
8452      145691                145600             145730   
1334      145679                145600             145730  

There are 130 minutes in this match.这场比赛有130分钟。 How do I calculate the amount of tweets per minute?如何计算每分钟的推文数量? "tweet id" represents a unique tweet. “tweet id”代表一条独特的推文。

Expected result format:预期结果格式:

minutes分钟 count of tweets推文数量
1 1个 2 2个
2 2个 1 1个
3 3个 2 2个
4 4个 3 3个
5 5个 1 1个
6 6个 0 0
7 7 0 0
8 8个 2 2个
9 9 1 1个
10 10 0 0

Assuming tweet id is unique and using Pyspark and raw rdd:假设 tweet id 是唯一的并且使用 Pyspark 和原始 rdd:

rdd = sc.parallelize([(1001 ,145678, 145600, 145730),
(1002 ,145678, 145600, 145730),
(1005 ,145680, 145600, 145730), 
(12278 ,145687, 145600, 145730), 
(765558 ,145688, 145600, 145730), 
(724323 ,145689, 145600, 145730), 
(875857 ,145688, 145600, 145730), 
(79375 ,145685, 145600, 145730), 
(84666 ,145686, 145600, 145730), 
(335556 ,145687, 145600, 145730), 
(29990 ,145688, 145600, 145730), 
(56 ,145689, 145600, 145730), 
(968867 ,145690, 145600, 145730), 
(8452 ,145691, 145600, 145730), 
(1334 ,145679, 145600, 145730) ])

result_dict = rdd.filter(lambda x: x[2] <= x[1] <= x[3]).map(lambda x: (x[1] - x[2], 0))\
.countByKey()

print "minutes count of tweets"
for i in sorted(result_dict.iteritems()):
    print "{0}\t{1}".format(i[0], i[1])

Result:结果:

minutes count of tweets
78  2
79  1
80  1
85  1
86  1
87  2
88  3
89  2
90  1
91  1

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何查找数组的任何元素是否在 pyspark 的范围内 - How to find if any element of an array is within a range in pyspark 如何在Spark(Scala或Python)中将时间范围扩展为每分钟间隔? - How to expand a time range into per-minute intervals in Spark (Scala or Python)? 从大时间范围Python中获取每分钟股票数据 - Grabbing per minute stock data from a large time range Python 如何对每分钟不均匀的时间戳进行重新采样 - How to resample uneven time stamps per minute 如何在不使用起始索引或范围作为参数的情况下找到单词中第二次或第三次出现的索引号? (蟒蛇) - How to find the index number of second or third occurence of a word in a sentence, without using starting index or range as parameter? (python) Pyspark:在特定时间范围内将值连接到列表中 - Pyspark: Concatenating values into lists within certain time range 了解事件在更大时间范围内的某个时间段内发生频率的算法(Pandas) - Algorithm to know how often the event happened during some period of time within greater time range (Pandas) 使用pyspark根据某些标准在Spark数据帧中查找3个最近的日期 - Find 3 closest dates in Spark dataframe per some criteria using pyspark 使用熊猫在时间序列数据中查找丢失的分钟数据 - Find missing minute data in time series data using pandas 如何计算 Pyspark 中先前出现的次数 - How to do a count the number of previous occurence in Pyspark
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM