I am trying to output a table, where I can see the km run per day per vehicle, but the calculations I am getting is wrong.
to give a small snippet of the data I have. each vehicle sends its current odometer readout to a central server a few hundred times per day.
TS DATE VEH ODOMETER
2021-03-12 12:58:15.500 2021-03-12 008 2932
2021-03-12 00:00:21.700 2021-03-12 002 64253
2021-03-12 00:02:21.500 2021-03-12 002 64254
2021-03-12 00:03:41.400 2021-03-12 002 64255
2021-03-12 00:05:38.200 2021-03-12 002 64256
... ... ... ...
2021-03-12 23:55:88.100 2021-03-12 002 64953
2021-03-12 00:05:38.200 2021-03-13 002 64954
as it should be clear in above example, the first and last odometer readout for vehicle 2 on 2021-03-12 is 64953-64253 = 700 km but the first readout on the next day is 64954, so with grouping the diff() between the first and last odometer value for the day, some km is disappearing, as shown below
def grp_odo(dfObj):
dfObj['ODOMETER'] = dfObj['ODOMETER'].astype(int)
dfObj["km"] = dfObj.groupby(["DATE","VEH"])["ODOMETER"].diff()
sum_km = dfObj.groupby(["DATE","VEH"], as_index=False)["km"].sum()
return sum_km
dfodo = grp_odo(df[['DATE','VEH', 'ODOMETER']].loc[(~pd.isna(df['ODOMETER']))])
Print the groupby odometer difference per day:
print(dfodo)
DATE WSTRPVID km
0 2021-01-01 001 523.0
1 2021-01-01 002 700.0
2 2021-01-01 003 781.0
3 2021-01-01 004 2.0
4 2021-01-01 005 553.0
... ... ... ...
3375 2021-04-09 034 802.0
3376 2021-04-09 035 615.0
3377 2021-04-09 036 778.0
3378 2021-04-09 038 425.0
3379 2021-04-09 039 386.0
print the sum the total km run by vehicle to from start to finish of the grouped data.
print(dfodo[dfodo.VEH== "002"].sum())
km: 36796
Print the sum of highest and lowest odometer value in dataframe
print(df[df.VEH== "002"].groupby('VEH')['ODOMETER'].agg(np.ptp))
VEH
002 36898
Name: ODOMETER, dtype: int64
I would like to have an output, where 2021-03-12 is 701km, so difference between first value of 2021-03-12 and first value of 2021-03-13, is this possible?
Input:
>>> df
TS DATE VEH ODOMETER
0 2021-03-12 12:58:15.500 2021-03-12 008 2932.0
1 2021-03-12 00:00:21.700 2021-03-12 002 64253.0
2 2021-03-12 00:02:21.500 2021-03-12 002 64254.0
3 2021-03-12 00:03:41.400 2021-03-12 002 64255.0
4 2021-03-12 00:05:38.200 2021-03-12 002 64256.0
5 2021-03-12 23:55:48.100 2021-03-12 002 64953.0
6 2021-03-12 00:05:38.200 2021-03-13 002 64954.0
Output:
>>> df.assign(TOTAL=df.groupby("VEH")["ODOMETER"].shift(-1) - df["ODOMETER"]) \
.groupby(["DATE", "VEH"]).sum()["TOTAL"]
DATE VEH
2021-03-12 002 701.0
008 0.0
2021-03-13 002 0.0
Name: TOTAL, dtype: float64
df = pd.DataFrame({'DATE': ['2021-03-12', '2021-03-12', '2021-03-12', '2021-03-12', '2021-03-12', '2021-03-12', '2021-03-13'],
'VEH': ['008', '002', '002', '002', '002', '002', '002',],
'ODOMETER': [2932, 64253, 64254, 64255, 64256, 64953, 64954]})
df.sort_values(['VEH', 'DATE'], inplace=True)
DATE VEH ODOMETER
1 2021-03-12 002 64253
2 2021-03-12 002 64254
3 2021-03-12 002 64255
4 2021-03-12 002 64256
5 2021-03-12 002 64953
6 2021-03-13 002 64954
0 2021-03-12 008 2932
Create a column that takes minimum ODOMETER value for each vehicle, date group
dff = df.groupby(['VEH', 'DATE'], as_index=False).agg({'ODOMETER': 'min'})
VEH DATE ODOMETER
0 002 2021-03-12 64253
1 002 2021-03-13 64954
2 008 2021-03-12 2932
The take difference between each date for a vehicle group.
dff['TOTAL_DIST'] = dff.groupby('VEH')['ODOMETER'].shift(-1)-dff.groupby('VEH')['ODOMETER'].shift(0)
VEH DATE ODOMETER TOTAL_DIST
0 002 2021-03-12 64253 701.0
1 002 2021-03-13 64954 NaN
2 008 2021-03-12 2932 NaN
You can make use of 2 consecutive df.groupby()
together with GroupBy.first()
and shift()
as follows:
df_daily = df.groupby(['DATE', 'VEH'], as_index=False)['ODOMETER'].first()
df_daily['km_diff'] = df_daily.groupby('VEH')['ODOMETER'].shift(-1) - df_daily.groupby('VEH')['ODOMETER'].shift(0)
Test data construction
cols= ['TS', 'DATE', 'VEH', 'ODOMETER']
data = [
['2021-03-12 12:58:15.500', '2021-03-12', '008' , 2932],
['2021-03-13 12:58:15.500', '2021-03-13', '008' , 3032],
['2021-03-12 00:00:21.700', '2021-03-12', '002', 64253],
['2021-03-12 00:02:21.500', '2021-03-12', '002', 64254],
['2021-03-12 00:03:41.400', '2021-03-12', '002', 64255],
['2021-03-12 00:05:38.200', '2021-03-12', '002', 64256],
['2021-03-12 23:55:88.100', '2021-03-12', '002', 64953],
['2021-03-12 00:05:38.200', '2021-03-13', '002', 64954]
]
df = pd.DataFrame(data, columns=cols)
print(df)
TS DATE VEH ODOMETER
0 2021-03-12 12:58:15.500 2021-03-12 008 2932
1 2021-03-13 12:58:15.500 2021-03-13 008 3032 <=== Added this test data
2 2021-03-12 00:00:21.700 2021-03-12 002 64253
3 2021-03-12 00:02:21.500 2021-03-12 002 64254
4 2021-03-12 00:03:41.400 2021-03-12 002 64255
5 2021-03-12 00:05:38.200 2021-03-12 002 64256
6 2021-03-12 23:55:88.100 2021-03-12 002 64953
7 2021-03-12 00:05:38.200 2021-03-13 002 64954
Run new codes
df_daily = df.groupby(['DATE', 'VEH'], as_index=False)['ODOMETER'].first()
df_daily['km_diff'] = df_daily.groupby('VEH')['ODOMETER'].shift(-1) - df_daily.groupby('VEH')['ODOMETER'].shift(0)
Result:
print(df_daily)
DATE VEH ODOMETER km_diff
0 2021-03-12 002 64253 701.0
1 2021-03-12 008 2932 100.0
2 2021-03-13 002 64954 NaN
3 2021-03-13 008 3032 NaN
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.