使用Apache-Spark分析时间序列

Question

I have very large time series data, the data format is: (arrival_time, key, value), the unit of time is sec, for example: 我有非常大的时间序列数据，数据格式是：（arrival_time，key，value），时间单位是秒，例如：

0.01, k, v
0.03, k, v
....
1.00, k, v
1.10, k, v
1.20, k, v
1.99, k, v
2.00, k, v
...

What I need to do is to get the number of lines per second of the whole data. 我需要做的是获得整个数据的每秒行数。 By now, I use pySpark and my code is like: 到现在为止，我使用pySpark，我的代码如下：

linePerSec = []
lo = rdd.take(1)[0]
hi = lo + 1.0
end = rdd.collect()[-1][0]
while(hi < end):
     number = rdd.filter(lambda (t, k, v): t >= lo and t < hi).count()
     linePerSec.append(number)
     lo = hi
     hi = lo + 1.0

But it's very slow, even slower than just going through the data line by line in a for loop. 但它非常慢，甚至比在for循环中逐行遍历数据更慢。 I guess it's because rdd.filter() goes through the whole rdd to find the lines that meet the filter's condition. 我想这是因为rdd.filter（）遍历整个rdd以找到满足过滤条件的行。 But for time series, we don't need to go through the data after the hi boundary in my code. 但是对于时间序列，我们不需要在代码中的hi边界之后遍历数据。 Is there any solutions to let spark to stop going through rdd in my situation? 在我的情况下，是否有任何解决方案让火花停止通过rdd？ Thanks! 谢谢！

Answer 1

First lets create some dummy data: 首先让我们创建一些虚拟数据：

rdd = sc.parallelize(
    [(0.01, "k", "v"),
    (0.03, "k", "v"),
    (1.00, "k", "v"),
    (1.10, "k", "v"),
    (1.20, "k", "v"),
    (1.99, "k", "v"),
    (2.00, "k", "v"),
    (3.10, "k", "v"),
    (4.50, "k", "v")])

extract time field from a RDD: 从RDD中提取时间字段：

def get_time(x):
    (start, _, _) = x
    return start

times = rdd.map(get_time)

Next we'll need a function mapping from a time to a key: 接下来，我们需要一个从时间到键的函数映射：

def get_key_(start):
    offset = start - int(start)
    def get_key(x):
        w = int(x) + offset
        return w if x >= w else int(x - 1) + offset
    return get_key

find the minimum and maximum time 找到最短和最长时间

start = times.takeOrdered(1)[0]
end = times.top(1)[0]

generate an actual key function: 生成一个实际的键功能：

get_key = get_key_(start)

and compute mean 并计算平均值

from operator import add

total = (times
  .map(lambda x: (get_key(x), 1))
  .reduceByKey(add)
  .values()
  .sum())

time_range = get_key(end) - get_key(start) + 1.0

mean = total / time_range

mean
## 1.8

Quick check: 快速检查：

[0.01, 1.01): 3 [0.01,1.01]：3
[1.01, 2.01): 4 [1.01,2.01]：4
[2.01, 3.01): 0 [2.01,3.01]：0
[3.01, 4.01): 1 [3.01,4.01]：1
[4.01, 5.01): 1 [4.01,5.01]：1

It gives 9 / 5 = 1.8 它给出9/5 = 1.8

Data frame equivalent can look like this: 等效的数据框可以如下所示：

from pyspark.sql.functions import count, col, sum, lit, min, max

# Select only arrival times
arrivals = df.select("arrival_time")

# This is almost identical as before
start = df.agg(min("arrival_time")).first()[0]
end = df.agg(max("arrival_time")).first()[0]

get_key = get_key_(start)
time_range = get_key(end) - get_key(start) + 1.0

# But we'll need offset as well
offset = start - int(start)

# and define a bucket column
bucket = (col("arrival_time") - offset).cast("integer") + offset

line_per_sec = (df
    .groupBy(bucket)
    .agg(count("*").alias("cnt"))
    .agg((sum("cnt") / lit(time_range)).alias("mean")))

line_per_sec.show()

 ## +----+
 ## |mean|
 ## +----+
 ## | 1.8|
 ## +----+

Please note that this is very similar to the solution provided by Nhor with two main differences: 请注意，这与Nhor提供的解决方案非常相似，主要有两点不同：

uses the same start logic as your code 使用与代码相同的启动逻辑
correctly handles empty intervals 正确处理空间隔

Answer 2

What I would do is the first I would floor the time values: 我要做的是第一次将时间值置于最低点：

from pyspark.sql.functions import *
df = df.select(floor(col('arrival_time')).alias('arrival_time'))

Now you have your arrival_time floored and you're ready to count number of lines in each second: 现在你有你的arrival_time地板，你准备好来计算每种第二行数：

df = df.groupBy(col('arrival_time')).count()

Now when you have counted the lines in each second, you can get all the rows and divide thier sum by count to get average lines per second: 现在，当您计算每秒的行数时，您可以获取所有行并按计数除以它们以获得每秒的平均行数：

lines_sum = df.select(sum(col('count')).alias('lines_sum')).first().lines_sum
seconds_sum = df.select(count(col('arrival_time')).alias('seconds_sum')).first().seconds_sum
result = lines_sum / seconds_sum

使用Apache-Spark分析时间序列

问题描述

2 个解决方案

解决方案1
3 已采纳 2015-11-16 08:40:13

解决方案2
0 2015-11-16 07:50:00

使用Apache-Spark分析时间序列

问题描述

2 个解决方案

解决方案1 3 已采纳 2015-11-16 08:40:13

解决方案2 0 2015-11-16 07:50:00

解决方案1
3 已采纳 2015-11-16 08:40:13

解决方案2
0 2015-11-16 07:50:00