简体   繁体   中英

Iterate through list and compare values

im trying to iterate through a column filled with lists of dates and say if the second date is 10 mins or more than the first date then '1' else '0' and if third date 10 mins or greater than second date then '1' else '0' ect ect.

Im sorry if this has been answered i just cant seem to find any help with this.

Lists are all different sizes. Does anyone know where how i should go about doing something like this?

df = df_data_collective.groupBy("customer_id").agg(
    F.expr("collect_list(start_dt)").alias("start_times")
)

This outputs the customer id and lists of datetimes likes this

['2020-04-02T08:15:50+01:00', '2020-04-02T08:15:53+01:00', '2020-04-02T08:15:56+01:00', '2020-04-02T08:16:01+01:00', '2020-04-02T08:16:07+01:00', '2020-04-02T08:21:05+01:00', '2020-04-02T08:21:17+01:00', '2020-04-02T08:21:30+01:00', '2020-04-02T08:21:43+01:00', '2020-04-02T08:21:49+01:00', '2020-04-02T08:22:11+01:00', '2020-04-02T08:22:16+01:00', '2020-04-02T08:24:02+01:00', '2020-04-02T08:24:09+01:00', '2020-04-02T08:24:37+01:00', '2020-04-02T08:36:26+01:00', '2020-04-02T08:39:25+01:00', '2020-04-02T08:39:41+01:00', '2020-04-02T08:39:52+01:00', '2020-04-02T08:40:18+01:00', '2020-04-02T08:40:27+01:00', '2020-04-02T08:40:33+01:00', '2020-04-02T08:40:49+01:00', '2020-04-02T08:41:03+01:00', '2020-04-02T08:41:29+01:00', '2020-04-02T08:42:00+01:00', '2020-04-02T08:42:23+01:00', '2020-04-02T08:42:57+01:00', '2020-04-02T08:44:43+01:00', '2020-04-02T08:44:49+01:00']

I have a very basic knowledge of for loops however still in training and looking to see if anyone can offer any advice?

from datetime import datetime, timedelta

dt_str_list = ['2020-04-02T08:15:50+01:00', '2020-04-02T08:15:53+01:00',
               '2020-04-02T08:15:56+01:00', '2020-04-02T08:16:01+01:00',
               '2020-04-02T08:16:07+01:00', '2020-04-02T08:21:05+01:00',
               '2020-04-02T08:21:17+01:00', '2020-04-02T08:21:30+01:00',
               '2020-04-02T08:21:43+01:00', '2020-04-02T08:21:49+01:00',
               '2020-04-02T08:22:11+01:00', '2020-04-02T08:22:16+01:00',
               '2020-04-02T08:24:02+01:00', '2020-04-02T08:24:09+01:00',
               '2020-04-02T08:24:37+01:00', '2020-04-02T08:36:26+01:00',
               '2020-04-02T08:39:25+01:00', '2020-04-02T08:39:41+01:00',
               '2020-04-02T08:39:52+01:00', '2020-04-02T08:40:18+01:00',
               '2020-04-02T08:40:27+01:00', '2020-04-02T08:40:33+01:00',
               '2020-04-02T08:40:49+01:00', '2020-04-02T08:41:03+01:00',
               '2020-04-02T08:41:29+01:00', '2020-04-02T08:42:00+01:00',
               '2020-04-02T08:42:23+01:00', '2020-04-02T08:42:57+01:00',
               '2020-04-02T08:44:43+01:00', '2020-04-02T08:44:49+01:00']


dt_list = [datetime.strptime(dt_str, '%Y-%m-%dT%H:%M:%S%z')
           for dt_str in dt_str_list]

minute_10 = timedelta(minutes=10)
flags = [1 if dt_list[i] - dt_list[i-1] > minute_10 else 0
         for i in range(1, len(dt_list))]

You can use the str.split() method:

from datetime import datetime, timedelta

lst = ['2020-04-02T08:15:50+01:00', '2020-04-02T08:15:53+01:00', '2020-04-02T08:15:56+01:00', '2020-04-02T08:16:01+01:00', '2020-04-02T08:16:07+01:00', '2020-04-02T08:21:05+01:00', '2020-04-02T08:21:17+01:00', '2020-04-02T08:21:30+01:00', '2020-04-02T08:21:43+01:00', '2020-04-02T08:21:49+01:00', '2020-04-02T08:22:11+01:00', '2020-04-02T08:22:16+01:00', '2020-04-02T08:24:02+01:00', '2020-04-02T08:24:09+01:00', '2020-04-02T08:24:37+01:00', '2020-04-02T08:36:26+01:00', '2020-04-02T08:39:25+01:00', '2020-04-02T08:39:41+01:00', '2020-04-02T08:39:52+01:00', '2020-04-02T08:40:18+01:00', '2020-04-02T08:40:27+01:00', '2020-04-02T08:40:33+01:00', '2020-04-02T08:40:49+01:00', '2020-04-02T08:41:03+01:00', '2020-04-02T08:41:29+01:00', '2020-04-02T08:42:00+01:00', '2020-04-02T08:42:23+01:00', '2020-04-02T08:42:57+01:00', '2020-04-02T08:44:43+01:00', '2020-04-02T08:44:49+01:00']

def s(d):
    h,m,s = d.split(':',2)
    h = int(h[-2:])*60*60
    m = int(m)*60
    s = int(s[:2])
    return h+m+s

c = [1 if s(lst[i-1])-s(d) >= 600 and i else 0 for i,d in enumerate(lst)]

print(c)

Output:

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0]

First, have to convert start_dt to timestamp format, then after collecting list, we can apply transform(with index as i) function with unix_timestamp to get desired output. (transform is available as of spark2.4 )

from pyspark.sql import functions as F

df.show() #sample dataframe

#+-----------+-------------------------+
#|customer_id|start_dt                 |
#+-----------+-------------------------+
#|1          |2020-04-02T08:15:50+01:00|
#|1          |2020-04-02T08:15:53+01:00|
#|1          |2020-04-02T08:15:56+01:00|
#|1          |2020-04-02T08:16:01+01:00|
#|1          |2020-04-02T08:16:07+01:00|
#|1          |2020-04-02T08:21:05+01:00|
#|1          |2020-04-02T08:21:17+01:00|
#|1          |2020-04-02T08:21:30+01:00|
#|1          |2020-04-02T08:21:43+01:00|
#|1          |2020-04-02T08:21:49+01:00|
#+-----------+-------------------------+
only showing top 10 rows


df.withColumn("start_dt", F.to_timestamp('start_dt',"yyyy-MM-dd'T'HH:mm:ss'+'SS:SS"))\
  .groupBy("customer_id").agg(F.collect_list('start_dt').alias('start_times'))\
  .withColumn("start_times", F.expr("""transform(start_times,(x,i)-> IF(i>0 and (unix_timestamp(x)-\
                                                                     unix_timestamp(start_times[i-1])>=600),1,0))"""))\
  .show(truncate=False)

#+-----------+------------------------------------------------------------------------------------------+
#|customer_id|start_times                                                                               |
#+-----------+------------------------------------------------------------------------------------------+
#|1          |[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]|
#+-----------+------------------------------------------------------------------------------------------+

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM