I have a file format as bellow,
0, Alpha,-3.9, 4, 2001-02-01 08:00:00, 5, 20
0, Beta, -3.8, 3, 2001-02-01 08:15:00, 6, 21
1, Gamma,-3.7, 8, 2001-02-01 08:30:00, 7, 22
0, Alpha,-3.5, 4, 2001-02-01 08:45:00, 8, 23
0, Alpha,-3.9, 4, 2001-02-01 09:00:00, 8, 27
0, Gamma,-3.5, 5, 2001-02-01 09:15:00, 6, 21
and so forth... I am interested in the sum of 5th element
in each raw for a given Alpha/Beta/Gamma
for a time interval between 08:00:00 to 09:00:00
for example. I would like to have the following result using only rdd
based operation, between 08:00:00 to 09:00:00
.
Alpha 21
Beta 6
Gamma 7
This is what I did for the moment;
rdd = sc.textFile(myDataset)
newrdd = rdd.map(myFun) # myFun process each line
filterrdd = newrdd.filter(lambda e : e[4].startswith('2001-02-01') )
But I dont know how to proceed. or at least could not see a simple way to solve it using only rdd
based operations.
To filter by time between 08:00:00-09:00:00 (inclusively), you just need to make sure the time part of this string starts with either 08:
or 09:00:00
, thus your filter function can be e[4].split()[1].startswith(('08:', '09:00:00'))
. then you can do the regular RDD reduceByKey() etc.
newrdd.filter(lambda e: e[4].split()[1].startswith(('08:', '09:00:00'))) \
.map(lambda e: (e[1], int(e[5]))) \
.reduceByKey(lambda x,y: x+y) \
.collect()
#[(' Alpha', 21), (' Beta', 6), (' Gamma', 7)]
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.