简体   繁体   中英

Using Apache-Spark to analyze time series

I have very large time series data, the data format is: (arrival_time, key, value), the unit of time is sec, for example:

0.01, k, v
0.03, k, v
....
1.00, k, v
1.10, k, v
1.20, k, v
1.99, k, v
2.00, k, v
...

What I need to do is to get the number of lines per second of the whole data. By now, I use pySpark and my code is like:

linePerSec = []
lo = rdd.take(1)[0]
hi = lo + 1.0
end = rdd.collect()[-1][0]
while(hi < end):
     number = rdd.filter(lambda (t, k, v): t >= lo and t < hi).count()
     linePerSec.append(number)
     lo = hi
     hi = lo + 1.0

But it's very slow, even slower than just going through the data line by line in a for loop. I guess it's because rdd.filter() goes through the whole rdd to find the lines that meet the filter's condition. But for time series, we don't need to go through the data after the hi boundary in my code. Is there any solutions to let spark to stop going through rdd in my situation? Thanks!

First lets create some dummy data:

rdd = sc.parallelize(
    [(0.01, "k", "v"),
    (0.03, "k", "v"),
    (1.00, "k", "v"),
    (1.10, "k", "v"),
    (1.20, "k", "v"),
    (1.99, "k", "v"),
    (2.00, "k", "v"),
    (3.10, "k", "v"),
    (4.50, "k", "v")])

extract time field from a RDD:

def get_time(x):
    (start, _, _) = x
    return start

times = rdd.map(get_time)

Next we'll need a function mapping from a time to a key:

def get_key_(start):
    offset = start - int(start)
    def get_key(x):
        w = int(x) + offset
        return w if x >= w else int(x - 1) + offset
    return get_key

find the minimum and maximum time

start = times.takeOrdered(1)[0]
end = times.top(1)[0]

generate an actual key function:

get_key = get_key_(start)

and compute mean

from operator import add

total = (times
  .map(lambda x: (get_key(x), 1))
  .reduceByKey(add)
  .values()
  .sum())

time_range = get_key(end) - get_key(start) + 1.0

mean = total / time_range

mean
## 1.8

Quick check:

  • [0.01, 1.01): 3
  • [1.01, 2.01): 4
  • [2.01, 3.01): 0
  • [3.01, 4.01): 1
  • [4.01, 5.01): 1

It gives 9 / 5 = 1.8

Data frame equivalent can look like this:

from pyspark.sql.functions import count, col, sum, lit, min, max

# Select only arrival times
arrivals = df.select("arrival_time")

# This is almost identical as before
start = df.agg(min("arrival_time")).first()[0]
end = df.agg(max("arrival_time")).first()[0]

get_key = get_key_(start)
time_range = get_key(end) - get_key(start) + 1.0

# But we'll need offset as well
offset = start - int(start)

# and define a bucket column
bucket = (col("arrival_time") - offset).cast("integer") + offset

line_per_sec = (df
    .groupBy(bucket)
    .agg(count("*").alias("cnt"))
    .agg((sum("cnt") / lit(time_range)).alias("mean")))

line_per_sec.show()

 ## +----+
 ## |mean|
 ## +----+
 ## | 1.8|
 ## +----+

Please note that this is very similar to the solution provided by Nhor with two main differences:

  • uses the same start logic as your code
  • correctly handles empty intervals

What I would do is the first I would floor the time values:

from pyspark.sql.functions import *
df = df.select(floor(col('arrival_time')).alias('arrival_time'))

Now you have your arrival_time floored and you're ready to count number of lines in each second:

df = df.groupBy(col('arrival_time')).count()

Now when you have counted the lines in each second, you can get all the rows and divide thier sum by count to get average lines per second:

lines_sum = df.select(sum(col('count')).alias('lines_sum')).first().lines_sum
seconds_sum = df.select(count(col('arrival_time')).alias('seconds_sum')).first().seconds_sum
result = lines_sum / seconds_sum

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM