Here is an example of my csv file:
EMPL_NO,ADRESSE,DIST_KM,DIST_MIN,DATE_INTERNE_INSPECTION
5,H4N 1P9,541,60,2023-06-03
5,H4N 1P9,541,60,2024-06-03
5,H4N 1P9,541,60,2023-05-29
5,H4N 1P9,541,60,2024-05-29
5,H4N 1P9,541,60,2023-06-05
5,H4N 1P9,541,60,2024-06-05
5,H4N 1P9,541,60,2026-06-05
12,H4N 1G4,503,40,2021-06-05
12,H4N 1G4,503,40,2023-06-05
EMPL_NO
is my 'primary key'.
So, for every EMPL_NO
, I need to check the corresponding dates, and I need to compare them with each other. I need to regroup them in different groups. It should regroup the values that have at most 90 days difference between them. And for the others, it should display them in another group.
For example, the expect output for the df seen above should be:
5, 2023-06-03, 2023-05-29, 2023-06-05
5, 2024-06-03, 2024-05-29, 2024-06-05
5, 2026-06-05
12, 2021-06-05
12, 2023-06-05
Can I get a little help if possible?
Here is a non-pandas solution:
import csv
import itertools as it
import datetime as dt
day_range=90
with open(fn) as f_in:
r=csv.reader(f_in)
header=next(r)
data=sorted([row for row in r], key=lambda x:(int(x[0]), x[-1]))
for k,v in it.groupby(data, key=lambda x: x[0]):
grp=list(v)
for sl in grp:
sl[-1]=dt.date(*tuple(map(int,sl[-1].split('-'))))
while grp:
rng=[grp.pop(0)]
while grp and (grp[0][-1]-rng[-1][-1]).days<=day_range:
rng.append(grp.pop(0))
print('{}, {}'.format(k,', '.join([str(e[-1]) for e in rng])))
With a file of the example given, prints:
5, 2023-05-29, 2023-06-03, 2023-06-05
5, 2024-05-29, 2024-06-03, 2024-06-05
5, 2026-06-05
12, 2021-06-05
12, 2023-06-05
Here is the same thing in Pandas:
import pandas as pd
import itertools as it
day_range=90
data=pd.read_csv(fn, parse_dates=['DATE_INTERNE_INSPECTION'])
data.sort_values(by=['EMPL_NO', 'DATE_INTERNE_INSPECTION'],inplace=True)
data['group']=(data['DATE_INTERNE_INSPECTION'].diff()
> pd.Timedelta(days=day_range)).cumsum()
for k,v in it.groupby(data.iterrows(),
key=lambda row: (row[1]['EMPL_NO'], row[1]['group'])):
row=', '.join([str(row[1]['DATE_INTERNE_INSPECTION'].date()) for row in v])
print('{}, {}'.format(k[0],row))
# same output
If you want to add fields to the output, you would do it in the last line of each of these.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.