[英]merge a single pandas dataframe multiple rows into one
I have a kind of time series dataframe of a train traffic data.我有一种火车交通数据的时间序列dataframe。
df = pd.DataFrame({
'train': [1, 1, 1, 2, 1, 2],
'station': [1000, 1001, 1001, 1000, 1002, 1003],
'time': pd.to_datetime(['20200525 13:30:00',
'20200525 13:45:00',
'20200525 13:50:00',
'20200525 13:35:00',
'20200525 14:10:00',
'20200525 14:00:00']),
'mvt': [10, -1, 2, 20, 0, 0],
},
columns=['train', 'station', 'time', 'mvt'])
On the stations the trains are either passing trough, or some coaches are attached or detached.在车站,火车不是通过低谷,就是一些教练被附上或分离。 As this is a time series data, every event is on a separate row.由于这是一个时间序列数据,因此每个事件都位于单独的行中。
I have to merge the rows of the same train on the same station where 2 movements (mvt) are happening one after the other (the second timestamp > first timestamp) and put the movements in 2 separate columns.我必须合并同一车站上同一列火车的行,其中 2 个运动(mvt)一个接一个地发生(第二个时间戳>第一个时间戳),并将运动放在 2 个单独的列中。 (mvt_x and mvt_y) and keeping the timestamp of the last operation. (mvt_x 和 mvt_y)并保留最后一次操作的时间戳。 On a single row passage the mvt_y will be always NaN.在单行通道中,mvt_y 将始终为 NaN。
Here is the expected result:这是预期的结果:
train station time mvt_x mvt_y
0 1 1000 2020-05-25 13:30:00 10 NaN
1 1 1001 2020-05-25 13:50:00 -1 2.0
2 2 1000 2020-05-25 13:35:00 20 NaN
3 1 1002 2020-05-25 14:10:00 0 NaN
4 2 1003 2020-05-25 14:00:00 0 NaN
Create the data frame创建数据框
import pandas as pd
df = pd.DataFrame({
'train': [1, 1, 1, 2, 1, 2],
'station': [1000, 1001, 1001, 1000, 1002, 1003],
'time': pd.to_datetime(['20200525 13:30:00',
'20200525 13:45:00',
'20200525 13:50:00',
'20200525 13:35:00',
'20200525 14:10:00',
'20200525 14:00:00']),
'mvt': [10, -1, 2, 20, 0, 0],
},
columns=['train', 'station', 'time', 'mvt'])
Compute rank, to identify (train-station) pairs with 1 movement vs 2 movements.计算等级,以识别(火车站)对与 1 个运动与 2 个运动。 Then re-shape the data frame, using rank:然后使用 rank 重新塑造数据框:
df['rank'] = df.groupby(['train', 'station'])['time'].rank().astype(int)
# re-shape the data frame - 'rank' is part of column label
x = (df.set_index(['train', 'station', 'rank'])
.unstack(level='rank')
.reset_index())
# find rows with a time with rank=2 ...
mask = x.loc[:, ('time', 2)].notna()
# ... and replace time-1 with time-2 (keep later time only)
x.loc[mask, ('time', 1)] = x.loc[mask, ('time', 2)]
# drop time-2
x = x.drop(columns=('time', 2))
# re-name columns
x.columns = ['train', 'station', 'time', 'mvt_x', 'mvt_y']
print(x)
train station time mvt_x mvt_y
0 1 1000 2020-05-25 13:30:00 10.0 NaN
1 1 1001 2020-05-25 13:50:00 -1.0 2.0
2 1 1002 2020-05-25 14:10:00 0.0 NaN
3 2 1000 2020-05-25 13:35:00 20.0 NaN
4 2 1003 2020-05-25 14:00:00 0.0 NaN
Beat me to the punch... but here's a code for cases with multiple visits to the same station打我一拳...但这是多次访问同一站点的案例的代码
# change df.time to the last time on each station
# sort by time to account for for multiple visits to a station
df = df.sort_values(['train', 'time', 'station'])
stopid = df.station.diff().cumsum().fillna(0).astype(int)
df.time = df.groupby(['train', 'station', stopid]).time.transform('last')
# create index for mvt on train_station groups
df = df.assign(mvt_id=df.groupby(['train', 'station', 'time']).cumcount())
# reshape df, similar to pivot
df = (
df.set_index(['train', 'station', 'time', 'mvt_id'])
.unstack('mvt_id').droplevel(0, axis=1)
)
df.columns = ['mvt_x', 'mvt_y'] # hardcoded for only 2 movements per station
# might need a generator if expecting more than 2 mvts
df = df.reset_index()
print(df)
Output Output
train station time mvt_x mvt_y
0 1 1000 2020-05-25 13:30:00 10.0 NaN
1 1 1001 2020-05-25 13:50:00 -1.0 2.0
2 1 1002 2020-05-25 14:10:00 0.0 NaN
3 2 1000 2020-05-25 13:35:00 20.0 NaN
4 2 1003 2020-05-25 14:00:00 0.0 NaN
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.