简体   繁体   English

如何根据 pandas 中的 3 分钟间隔对 ID 进行分组?

[英]How to group ID based on 3 min intervals in pandas?

I have a dataframe that looks like this:我有一个看起来像这样的 dataframe:

   ID     time      city        transport
0  1      10:20:00  London      car
1  20     08:50:20  Berlin      air plane
2  44     21:10:00  Paris       train
3  32     10:24:00  Rome        car
4  56     08:53:10  Berlin      air plane
5  90     21:8:00   Paris       train
.
.
.
1009 446  10:21:24  London     car

I want to group these data so that same value in 'city' and 'transport' but with time difference of +3min or -3min should have the same 'ID'.我想对这些数据进行分组,以便“城市”和“交通”中的相同值但时间差为 +3 分钟或 -3 分钟应该具有相同的“ID”。

I already tried pd.Grouper() like this but didn't work:我已经尝试过这样的 pd.Grouper() 但没有奏效:

df['time'] = pd.to_datetime(df['time'])
df['ID'] = df.groupby([pd.Grouper(key= 'time',freq ='3min'),'city','transport'])['ID'].transform('first')

The output is the first dataframe I had without any changes. output 是我没有任何更改的第一个 dataframe。 One reason could be that by using.datetime the date will be added as well to "time" and because my data is very big the date will differ and groupby doesn't work.一个原因可能是通过使用 .datetime 日期也将添加到“时间”中,并且因为我的数据非常大,日期会有所不同并且 groupby 不起作用。 I couldn't figure it out how to add time intervall (+3min or -3min) while using groupby and without adding DATE to 'time' column.我无法弄清楚如何在使用 groupby 并且不将 DATE 添加到“时间”列的情况下添加时间间隔(+3 分钟或 -3 分钟)。

What I'm expecting is this:我期待的是:

   ID     time      city        transport
0  1      10:20:00  London      car
1  20     08:50:20  Berlin      air plane
2  44     21:10:00  Paris       train
3  32     10:24:00  Rome        car
4  20     08:53:10  Berlin      air plane
5  44     21:8:00   Paris       train
.
.
.
1009 1  10:21:24  London     car

it has been a while that I'm struggling with this question and I really appreciate any help.我在这个问题上苦苦挣扎已经有一段时间了,我非常感谢任何帮助。 Thanks in advance提前致谢

Exploring pd.Grouper()探索pd.Grouper()

  1. found it useful to insert start time so that it's more obvious how buckets are being generated发现插入开始时间很有用,以便更明显地生成存储桶
  2. you requirement +/- 3mins, most closely is a 6min bucket.您需要 +/- 3 分钟,最接近的是 6 分钟的桶。 Mostly matches your requirement but +/- 3 mins of what?大部分符合您的要求,但 +/- 3 分钟是什么?
  3. have done something that just shows what has been grouped and shows time bucket做了一些只显示分组内容并显示时间段的事情

setup设置

df = pd.read_csv(io.StringIO("""   ID     time      city        transport
0  1      10:20:00  London      car
1  20     08:50:20  Berlin      air plane
2  44     21:10:00  Paris       train
3  32     10:24:00  Rome        car
4  56     08:53:10  Berlin      air plane
5  90     21:08:00   Paris       train
6  33  05:08:22  Paris  train"""), sep="\s\s+", engine="python")

# force in origin so grouper generates bucket every Xmins from midnight with no seconds...
df = pd.concat([pd.DataFrame({"time":[pd.Timedelta(0)],"dummy":[True]}), df]).assign(dummy=lambda dfa: dfa.dummy.fillna(False))
df = df.assign(td=pd.to_timedelta(df.time))

analysis分析

### DEBUGGER ### - see whats being grouped...
df.groupby([pd.Grouper(key="td", freq="6min"), "city","transport"]).agg(lambda x: list(x) if len(x)>0 else np.nan).dropna()
  • see that two time buckets will group >1 ID看到两个时间桶将分组 >1 ID
time时间 dummy ID ID
(Timedelta('0 days 05:06:00'), 'Paris', 'train') (Timedelta('0 days 05:06:00'), 'Paris', 'train') ['05:08:22'] ['05:08:22'] [False] [错误的] [33.0] [33.0]
(Timedelta('0 days 08:48:00'), 'Berlin', 'air plane') (Timedelta('0 days 08:48:00'), 'Berlin', 'air plane') ['08:50:20', '08:53:10'] ['08:50:20', '08:53:10'] [False, False] [假的,假的] [20.0, 56.0] [20.0, 56.0]
(Timedelta('0 days 10:18:00'), 'London', 'car') (Timedelta('0 days 10:18:00'), 'London', 'car') ['10:20:00'] ['10:20:00'] [False] [错误的] [1.0] [1.0]
(Timedelta('0 days 10:24:00'), 'Rome', 'car') (Timedelta('0 days 10:24:00'), 'Rome', 'car') ['10:24:00'] ['10:24:00'] [False] [错误的] [32.0] [32.0]
(Timedelta('0 days 21:06:00'), 'Paris', 'train') (Timedelta('0 days 21:06:00'), 'Paris', 'train') ['21:10:00', '21:08:00'] ['21:10:00', '21:08:00'] [False, False] [假的,假的] [44.0, 90.0] [44.0, 90.0]

solution解决方案

# finally +/- double the window.  NB this is not +/- but rows that group the same
(df.assign(ID=lambda dfa: dfa
           .groupby([pd.Grouper(key= 'td',freq ='6min'),'city','transport'])['ID']
           .transform('first'))
 # cleanup... NB needs changing if dummy row is not inserted
 .query("not dummy")
 .drop(columns=["td","dummy"])
 .assign(ID=lambda dfa: dfa.ID.astype(int))
)
time时间 ID ID city城市 transport运输
10:20:00 10:20:00 1 1 London伦敦 car
08:50:20 08:50:20 20 20 Berlin柏林 air plane飞机
21:10:00 21:10:00 44 44 Paris巴黎 train火车
10:24:00 10:24:00 32 32 Rome罗马 car
08:53:10 08:53:10 20 20 Berlin柏林 air plane飞机
21:08:00 21:08:00 44 44 Paris巴黎 train火车
05:08:22 05:08:22 33 33 Paris巴黎 train火车
def convert(seconds): 
    seconds = seconds % (24 * 3600) 
    hour = seconds // 3600
    seconds %= 3600
    minutes = seconds // 60
    seconds %= 60
    return hour,minutes,seconds

def get_sec(h,m,s):
     """Get Seconds from time."""
    if h==np.empty:
        h=0
    if m==np.empty:
        m=0
    if s==np.empty:
        s=0
    return int(h) * 3600 + int(m) * 60 + int(s)    

 df['time']=df['time'].apply(lambda x:       datetime.strptime(x,'%H:%M:%S') if isinstance(x,str) else x )

 df=df.sort_values(by=["time"])
 print(df)

 prev_hour=np.empty
 prev_minute=np.empty
 prev_second=np.empty
 for key,item in df.iterrows():
    curr_hour=item.time.hour
    curr_minute=item.time.minute
    curr_second=item.time.second
    curr_id=item.id
    curr_seconds=get_sec(curr_hour, curr_minute ,curr_second)
    prev_seconds=get_sec(prev_hour, prev_minute,prev_second)
    diff_seconds=curr_seconds-prev_seconds
hour,minute,second=convert(diff_seconds)
    if (hour==0) & (minute <=3):
        df.loc[key,'id']=prev_id
    prev_hour=item.time.hour
    prev_minute=item.time.minute
    prev_second=item.time.second
    prev_id=item.id

print(df)


output:
   id                time    city  transport
1  20 1900-01-01 08:50:20  Berlin  air plane
4  20 1900-01-01 08:53:10  Berlin  air plane
0   1 1900-01-01 10:20:00  London        car
3  32 1900-01-01 10:24:00    Rome        car
5  90 1900-01-01 21:08:00   Paris      train
2  90 1900-01-01 21:10:00   Paris      train

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM