I have very large time series data, the data format is: (arrival_time, key, value), the unit of time is sec, for example:
0.01, k, v
0.03, k, v
....
1.00, k, v
1.10, k, v
1.20, k, v
1.99, k, v
2.00, k, v
...
What I need to do is to get the number of lines per second of the whole data. By now, I use pySpark and my code is like:
linePerSec = []
lo = rdd.take(1)[0]
hi = lo + 1.0
end = rdd.collect()[-1][0]
while(hi < end):
number = rdd.filter(lambda (t, k, v): t >= lo and t < hi).count()
linePerSec.append(number)
lo = hi
hi = lo + 1.0
But it's very slow, even slower than just going through the data line by line in a for loop. I guess it's because rdd.filter() goes through the whole rdd to find the lines that meet the filter's condition. But for time series, we don't need to go through the data after the hi boundary in my code. Is there any solutions to let spark to stop going through rdd in my situation? Thanks!
First lets create some dummy data:
rdd = sc.parallelize(
[(0.01, "k", "v"),
(0.03, "k", "v"),
(1.00, "k", "v"),
(1.10, "k", "v"),
(1.20, "k", "v"),
(1.99, "k", "v"),
(2.00, "k", "v"),
(3.10, "k", "v"),
(4.50, "k", "v")])
extract time field from a RDD:
def get_time(x):
(start, _, _) = x
return start
times = rdd.map(get_time)
Next we'll need a function mapping from a time to a key:
def get_key_(start):
offset = start - int(start)
def get_key(x):
w = int(x) + offset
return w if x >= w else int(x - 1) + offset
return get_key
find the minimum and maximum time
start = times.takeOrdered(1)[0]
end = times.top(1)[0]
generate an actual key function:
get_key = get_key_(start)
and compute mean
from operator import add
total = (times
.map(lambda x: (get_key(x), 1))
.reduceByKey(add)
.values()
.sum())
time_range = get_key(end) - get_key(start) + 1.0
mean = total / time_range
mean
## 1.8
Quick check:
It gives 9 / 5 = 1.8
Data frame equivalent can look like this:
from pyspark.sql.functions import count, col, sum, lit, min, max
# Select only arrival times
arrivals = df.select("arrival_time")
# This is almost identical as before
start = df.agg(min("arrival_time")).first()[0]
end = df.agg(max("arrival_time")).first()[0]
get_key = get_key_(start)
time_range = get_key(end) - get_key(start) + 1.0
# But we'll need offset as well
offset = start - int(start)
# and define a bucket column
bucket = (col("arrival_time") - offset).cast("integer") + offset
line_per_sec = (df
.groupBy(bucket)
.agg(count("*").alias("cnt"))
.agg((sum("cnt") / lit(time_range)).alias("mean")))
line_per_sec.show()
## +----+
## |mean|
## +----+
## | 1.8|
## +----+
Please note that this is very similar to the solution provided by Nhor with two main differences:
What I would do is the first I would floor the time values:
from pyspark.sql.functions import *
df = df.select(floor(col('arrival_time')).alias('arrival_time'))
Now you have your arrival_time
floored and you're ready to count number of lines in each second:
df = df.groupBy(col('arrival_time')).count()
Now when you have counted the lines in each second, you can get all the rows and divide thier sum by count to get average lines per second:
lines_sum = df.select(sum(col('count')).alias('lines_sum')).first().lines_sum
seconds_sum = df.select(count(col('arrival_time')).alias('seconds_sum')).first().seconds_sum
result = lines_sum / seconds_sum
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.