简体   繁体   中英

pyspark: rdd operation for timesteps

I have a file format as bellow,

0, Alpha,-3.9, 4, 2001-02-01 08:00:00, 5, 20
0, Beta, -3.8, 3, 2001-02-01 08:15:00, 6, 21
1, Gamma,-3.7, 8, 2001-02-01 08:30:00, 7, 22
0, Alpha,-3.5, 4, 2001-02-01 08:45:00, 8, 23
0, Alpha,-3.9, 4, 2001-02-01 09:00:00, 8, 27
0, Gamma,-3.5, 5, 2001-02-01 09:15:00, 6, 21

and so forth... I am interested in the sum of 5th element in each raw for a given Alpha/Beta/Gamma for a time interval between 08:00:00 to 09:00:00 for example. I would like to have the following result using only rdd based operation, between 08:00:00 to 09:00:00 .

Alpha 21
Beta 6
Gamma 7

This is what I did for the moment;

rdd = sc.textFile(myDataset)
newrdd = rdd.map(myFun) # myFun process each line 
filterrdd = newrdd.filter(lambda e : e[4].startswith('2001-02-01') )

But I dont know how to proceed. or at least could not see a simple way to solve it using only rdd based operations.

To filter by time between 08:00:00-09:00:00 (inclusively), you just need to make sure the time part of this string starts with either 08: or 09:00:00 , thus your filter function can be e[4].split()[1].startswith(('08:', '09:00:00')) . then you can do the regular RDD reduceByKey() etc.

newrdd.filter(lambda e: e[4].split()[1].startswith(('08:', '09:00:00'))) \
      .map(lambda e: (e[1], int(e[5]))) \
      .reduceByKey(lambda x,y: x+y) \
      .collect()
#[(' Alpha', 21), (' Beta', 6), (' Gamma', 7)]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM