简体   繁体   中英

Iterate over rows, compare dates and append to list

Here is an example of my csv file:

EMPL_NO,ADRESSE,DIST_KM,DIST_MIN,DATE_INTERNE_INSPECTION
5,H4N 1P9,541,60,2023-06-03
5,H4N 1P9,541,60,2024-06-03
5,H4N 1P9,541,60,2023-05-29
5,H4N 1P9,541,60,2024-05-29
5,H4N 1P9,541,60,2023-06-05
5,H4N 1P9,541,60,2024-06-05
5,H4N 1P9,541,60,2026-06-05
12,H4N 1G4,503,40,2021-06-05
12,H4N 1G4,503,40,2023-06-05

EMPL_NO is my 'primary key'.

So, for every EMPL_NO , I need to check the corresponding dates, and I need to compare them with each other. I need to regroup them in different groups. It should regroup the values that have at most 90 days difference between them. And for the others, it should display them in another group.

For example, the expect output for the df seen above should be:

5, 2023-06-03, 2023-05-29, 2023-06-05
5, 2024-06-03, 2024-05-29, 2024-06-05
5, 2026-06-05
12, 2021-06-05
12, 2023-06-05

Can I get a little help if possible?

Here is a non-pandas solution:

import csv 
import itertools as it 
import datetime as dt 

day_range=90

with open(fn) as f_in:
    r=csv.reader(f_in)
    header=next(r)
    data=sorted([row for row in r], key=lambda x:(int(x[0]), x[-1]))
    for k,v in it.groupby(data, key=lambda x: x[0]):
        grp=list(v)
        for sl in grp:
            sl[-1]=dt.date(*tuple(map(int,sl[-1].split('-'))))
        while grp:
            rng=[grp.pop(0)]
            while grp and (grp[0][-1]-rng[-1][-1]).days<=day_range:
                rng.append(grp.pop(0))
            print('{}, {}'.format(k,', '.join([str(e[-1]) for e in rng])))

With a file of the example given, prints:

5, 2023-05-29, 2023-06-03, 2023-06-05
5, 2024-05-29, 2024-06-03, 2024-06-05
5, 2026-06-05
12, 2021-06-05
12, 2023-06-05

Here is the same thing in Pandas:

import pandas as pd
import itertools as it 

day_range=90

data=pd.read_csv(fn, parse_dates=['DATE_INTERNE_INSPECTION'])

data.sort_values(by=['EMPL_NO', 'DATE_INTERNE_INSPECTION'],inplace=True)

data['group']=(data['DATE_INTERNE_INSPECTION'].diff() 
                > pd.Timedelta(days=day_range)).cumsum()

for k,v in it.groupby(data.iterrows(),
            key=lambda row: (row[1]['EMPL_NO'], row[1]['group'])):
    row=', '.join([str(row[1]['DATE_INTERNE_INSPECTION'].date()) for row in v])
    print('{}, {}'.format(k[0],row))
# same output

If you want to add fields to the output, you would do it in the last line of each of these.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM