I have a csv I've imported as a pandas dataframe which looks like this:
TripId, DeviceId, StartDate, EndDate
817d0e7, dbf69e23, 2015-04-18T13:54:27.000Z, 2015-04-18T14:59:06.000Z
817d0f5, fkri449g, 2015-04-18T13:59:21.000Z, 2015-04-18T14:50:56.000Z
8145g5g, dbf69e23, 2015-04-18T15:12:26.000Z, 2015-04-18T16:21:04.000Z
4jhbfu4, fkigit95, 2015-04-18T14:23:40.000Z, 2015-04-18T14:59:38.000Z
8145g66, dbf69e23, 2015-04-20T11:20:24.000Z, 2015-04-20T16:22:41.000Z
...
I want to add a new column, with an indicator value based on whether the DeviceId reappears in my dataframe, with a StartDate 1hour after the current EndDate. So my new dataframe should look like:
TripId, DeviceId, StartDate, EndDate, newcol
817d0e7, dbf69e23, 2015-04-18T13:54:27.000Z, 2015-04-18T14:59:06.000Z, 1
817d0f5, fkri449g, 2015-04-18T13:59:21.000Z, 2015-04-18T14:50:56.000Z, 0
8145g5g, dbf69e23, 2015-04-18T15:12:26.000Z, 2015-04-18T16:21:04.000Z, 0
4jhbfu4, fkigit95, 2015-04-18T14:23:40.000Z, 2015-04-18T14:59:38.000Z, 0
8145g66, dbf69e23, 2015-04-20T11:20:24.000Z, 2015-04-20T16:22:41.000Z, 0
...
I've started to write some code, but I'm unsure how to proceed.
df['newcol'] = np.where(df['DeviceId'].isin(df['DeviceId']) and , 1, 0)
One problem is that I'm not sure how to find device id in dataframe excluding current row, and another is that I don't know how to tackle the time issue.
EDIT: I've been working on it a bit, and my new code is now:
df['UniqueId'] = range(0, 14571, 1)
df['StartDate'] = pd.to_datetime(df['StartDate'])
df['EndDate'] = pd.to_datetime(df['EndDate'])
df2 = df.loc[df.duplicated(subset=['DeviceId'],keep=False)]
#Returns list of trips with repeated deviceid
DeviceIds = df2['DeviceId'].tolist()
DeviceIds = list(set(DeviceIds))
for ID in DeviceIds:
temp = df2.loc[df2['DeviceId'] == ID]
temp.sort_values(by='StartDate')
temp['PreviousEnd'] = temp['EndDate'].shift(periods=1)
temp['Difference'] = temp['StartDate'] - temp['PreviousEnd']
temp['Difference'] = [1 if x < pd.Timedelta('1H')
else 0 for x in temp['Difference']]
temp = temp[['UniqueId','Difference']]
df.join(temp, on='UniqueId', how='left',rsuffix='2')
The it creates the right temp dataframe, but I can't seem to join the values in Difference to the original dataframe
You can groupby
and compare column EndDate
with max
value of startDate
with 1H
:
def f(x):
#print (x)
#not sure if 1 Hour as added to startDate and if is necessary compare
#with ==, <, >
return x.EndDate > (x.startDate + pd.Timedelta('1H')).max()
mask = df.groupby('DeviceId').apply(f).reset_index(level=0, drop=True).reindex(df.index)
print (mask)
0 False
1 False
2 False
3 False
4 True
Name: EndDate, dtype: bool
Last convert boolean mask
to int
:
df['new_col'] = mask.astype(int)
print (df)
TripId DeviceId startDate EndDate new_col
0 817d0e7 dbf69e23 2015-04-18 13:54:27 2015-04-18 14:59:06 0
1 817d0f5 fkri449g 2015-04-18 13:59:21 2015-04-18 14:50:56 0
2 8145g5g dbf69e23 2015-04-18 15:12:26 2015-04-18 16:21:04 0
3 4jhbfu4 fkigit95 2015-04-18 14:23:40 2015-04-18 14:59:38 0
4 8145g66 dbf69e23 2015-04-20 11:20:24 2015-04-20 16:22:41 1
I managed to get it working, the code I used was:
df['UniqueId'] = range(0, 14571, 1)
df['StartDate'] = pd.to_datetime(df['StartDate'])
df['EndDate'] = pd.to_datetime(df['EndDate'])
#converts dates to dateTime
df2 = df.loc[df.duplicated(subset=['DeviceId'],keep=False)]
#Returns list of trips with repeated deviceid
DeviceIds = df2['DeviceId'].tolist()
DeviceIds = list(set(DeviceIds))
df3 = pd.DataFrame(columns = ['UniqueId','Difference'])
for ID in DeviceIds: #creats mini dataframes for every DeviceId
temp = df2.loc[df2['DeviceId'] == ID]
temp.sort_values(by='StartDate')
temp['PreviousEnd'] = temp['EndDate'].shift(periods=1)
temp['Difference'] = temp['StartDate'] - temp['PreviousEnd']
temp['Difference'] = [1 if x < pd.Timedelta('24H')
else 0 for x in temp['Difference']]
temp = temp[['UniqueId','Difference']]
df3 = pd.concat([df3,temp])
df.set_index('UniqueId').join(df3.set_index('UniqueId'),how='left')
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.