[英]How to group ID based on 3 min intervals in pandas?
I have a dataframe that looks like this:我有一个看起来像这样的 dataframe:
ID time city transport
0 1 10:20:00 London car
1 20 08:50:20 Berlin air plane
2 44 21:10:00 Paris train
3 32 10:24:00 Rome car
4 56 08:53:10 Berlin air plane
5 90 21:8:00 Paris train
.
.
.
1009 446 10:21:24 London car
I want to group these data so that same value in 'city' and 'transport' but with time difference of +3min or -3min should have the same 'ID'.我想对这些数据进行分组,以便“城市”和“交通”中的相同值但时间差为 +3 分钟或 -3 分钟应该具有相同的“ID”。
I already tried pd.Grouper() like this but didn't work:我已经尝试过这样的 pd.Grouper() 但没有奏效:
df['time'] = pd.to_datetime(df['time'])
df['ID'] = df.groupby([pd.Grouper(key= 'time',freq ='3min'),'city','transport'])['ID'].transform('first')
The output is the first dataframe I had without any changes. output 是我没有任何更改的第一个 dataframe。 One reason could be that by using.datetime the date will be added as well to "time" and because my data is very big the date will differ and groupby doesn't work.
一个原因可能是通过使用 .datetime 日期也将添加到“时间”中,并且因为我的数据非常大,日期会有所不同并且 groupby 不起作用。 I couldn't figure it out how to add time intervall (+3min or -3min) while using groupby and without adding DATE to 'time' column.
我无法弄清楚如何在使用 groupby 并且不将 DATE 添加到“时间”列的情况下添加时间间隔(+3 分钟或 -3 分钟)。
What I'm expecting is this:我期待的是:
ID time city transport
0 1 10:20:00 London car
1 20 08:50:20 Berlin air plane
2 44 21:10:00 Paris train
3 32 10:24:00 Rome car
4 20 08:53:10 Berlin air plane
5 44 21:8:00 Paris train
.
.
.
1009 1 10:21:24 London car
it has been a while that I'm struggling with this question and I really appreciate any help.我在这个问题上苦苦挣扎已经有一段时间了,我非常感谢任何帮助。 Thanks in advance
提前致谢
Exploring pd.Grouper()
探索
pd.Grouper()
df = pd.read_csv(io.StringIO(""" ID time city transport
0 1 10:20:00 London car
1 20 08:50:20 Berlin air plane
2 44 21:10:00 Paris train
3 32 10:24:00 Rome car
4 56 08:53:10 Berlin air plane
5 90 21:08:00 Paris train
6 33 05:08:22 Paris train"""), sep="\s\s+", engine="python")
# force in origin so grouper generates bucket every Xmins from midnight with no seconds...
df = pd.concat([pd.DataFrame({"time":[pd.Timedelta(0)],"dummy":[True]}), df]).assign(dummy=lambda dfa: dfa.dummy.fillna(False))
df = df.assign(td=pd.to_timedelta(df.time))
### DEBUGGER ### - see whats being grouped...
df.groupby([pd.Grouper(key="td", freq="6min"), "city","transport"]).agg(lambda x: list(x) if len(x)>0 else np.nan).dropna()
time![]() |
dummy![]() |
ID ![]() |
|
---|---|---|---|
(Timedelta('0 days 05:06:00'), 'Paris', 'train') ![]() |
['05:08:22'] ![]() |
[False] ![]() |
[33.0] ![]() |
(Timedelta('0 days 08:48:00'), 'Berlin', 'air plane') ![]() |
['08:50:20', '08:53:10'] ![]() |
[False, False] ![]() |
[20.0, 56.0] ![]() |
(Timedelta('0 days 10:18:00'), 'London', 'car') ![]() |
['10:20:00'] ![]() |
[False] ![]() |
[1.0] ![]() |
(Timedelta('0 days 10:24:00'), 'Rome', 'car') ![]() |
['10:24:00'] ![]() |
[False] ![]() |
[32.0] ![]() |
(Timedelta('0 days 21:06:00'), 'Paris', 'train') ![]() |
['21:10:00', '21:08:00'] ![]() |
[False, False] ![]() |
[44.0, 90.0] ![]() |
# finally +/- double the window. NB this is not +/- but rows that group the same
(df.assign(ID=lambda dfa: dfa
.groupby([pd.Grouper(key= 'td',freq ='6min'),'city','transport'])['ID']
.transform('first'))
# cleanup... NB needs changing if dummy row is not inserted
.query("not dummy")
.drop(columns=["td","dummy"])
.assign(ID=lambda dfa: dfa.ID.astype(int))
)
time![]() |
ID ![]() |
city![]() |
transport![]() |
---|---|---|---|
10:20:00 ![]() |
1 ![]() |
London![]() |
car![]() |
08:50:20 ![]() |
20 ![]() |
Berlin![]() |
air plane![]() |
21:10:00 ![]() |
44 ![]() |
Paris![]() |
train![]() |
10:24:00 ![]() |
32 ![]() |
Rome![]() |
car![]() |
08:53:10 ![]() |
20 ![]() |
Berlin![]() |
air plane![]() |
21:08:00 ![]() |
44 ![]() |
Paris![]() |
train![]() |
05:08:22 ![]() |
33 ![]() |
Paris![]() |
train![]() |
def convert(seconds):
seconds = seconds % (24 * 3600)
hour = seconds // 3600
seconds %= 3600
minutes = seconds // 60
seconds %= 60
return hour,minutes,seconds
def get_sec(h,m,s):
"""Get Seconds from time."""
if h==np.empty:
h=0
if m==np.empty:
m=0
if s==np.empty:
s=0
return int(h) * 3600 + int(m) * 60 + int(s)
df['time']=df['time'].apply(lambda x: datetime.strptime(x,'%H:%M:%S') if isinstance(x,str) else x )
df=df.sort_values(by=["time"])
print(df)
prev_hour=np.empty
prev_minute=np.empty
prev_second=np.empty
for key,item in df.iterrows():
curr_hour=item.time.hour
curr_minute=item.time.minute
curr_second=item.time.second
curr_id=item.id
curr_seconds=get_sec(curr_hour, curr_minute ,curr_second)
prev_seconds=get_sec(prev_hour, prev_minute,prev_second)
diff_seconds=curr_seconds-prev_seconds
hour,minute,second=convert(diff_seconds)
if (hour==0) & (minute <=3):
df.loc[key,'id']=prev_id
prev_hour=item.time.hour
prev_minute=item.time.minute
prev_second=item.time.second
prev_id=item.id
print(df)
output:
id time city transport
1 20 1900-01-01 08:50:20 Berlin air plane
4 20 1900-01-01 08:53:10 Berlin air plane
0 1 1900-01-01 10:20:00 London car
3 32 1900-01-01 10:24:00 Rome car
5 90 1900-01-01 21:08:00 Paris train
2 90 1900-01-01 21:10:00 Paris train
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.