繁体   English   中英

给定时间在 python pandas 中获取 5 分钟范围内的行

[英]given time get rows within 5min range in python pandas

我有两个数据框。

我想要做的是遍历 df_1 中的每一行获取其时间,然后 user_id 获取与 user_id 和时间匹配的行 +- 5 分钟并获取第一行的数据。 如果不在 5 分钟内返回 NaN

注意:两个数据帧都可以有多个 user_id

df_1 看起来像:

user_id      created_time       
   1          2020-03-01 00:00:25
   2          2020-03-06 04:20:25
   3          2020-03-06 07:00:15

df_2:

user_id          updated_at           lat        lng
  1          2020-03-01 00:02:25     35.2323    123.23
  2          2020-03-06 04:27:22     45.2323    121.23
  3          2020-03-06 06:59:59     13.2323    127.23

这就是我现在正在做的事情,但是它似乎非常低效并且容易出错。

lng_list = []
lat_list = []
for row in df_1.itertuples():
    created_time    = getattr(row, "created_time")
    user_id         = getattr(row, "user_id") 

    df = df_2.loc[(df_2["user_id"] == user_id) &
                  (df_2["updated_time"] >= created_time)].copy()    
    if len(df) != 0:
        row = df.iloc[0]

    else:
        last_df = df_2.loc[(df_2["user_id"] == user_id) &
                           (df_2["created_time"] <= created_time)].copy()

        if len(last_df) == 0:
            lng_list.append(np.nan)
            lat_list.append(np.nan)
        else:
            row = last_df.iloc[-1]


    lng_list.append(row["lng"])
    lat_list.append(row["lat"])

df_1["lng"] = lng_list
df_1["lat"] = lat_list

然后在创建列表后,我将插入 df_1 这似乎不是一个好习惯并且容易出错......

所以我想要的输出是:

user_id      created_time          lat         lng
  1          2020-03-01 00:00:25   35.2323   123.23  <- within 5min range
  2          2020-03-06 04:20:25   NaN        NaN   
  3          2020-03-06 07:00:15   13.2323    127.23

由于您在两个数据框中都有多个user_id ,因此merge可能是您的最佳选择:

new_df = (df_1.merge(df_2, on='user_id', how='right')
              .assign(time_diff=lambda x: x.created_time.sub(x.updated_at)
                                           .abs().lt(pd.to_timedelta(5, unit='min')),
                     )
         )
new_df.loc[~new_df['time_diff'], ['lat','lng']] = np.nan

输出:

   user_id        created_time          updated_at      lat     lng  time_diff
0        1 2020-03-01 00:00:25 2020-03-01 00:02:25  35.2323  123.23       True
1        2 2020-03-06 04:20:25 2020-03-06 04:27:22      NaN     NaN      False
2        3 2020-03-06 07:00:15 2020-03-06 06:59:59  13.2323  127.23       True

请注意,这可能无法解决您的问题,因为每个create_time都会有多个updated_at

请检查以下解决方案。

# Convert date column into datetime object 
df1['created_time'] = pd.to_datetime(df1['created_time'])
df2['updated_at'] = pd.to_datetime(df2['updated_at'])

# Create filters based on condition
user_id_condition = df1['user_id'] == df2['user_id'] 
n_min_before = df1['created_time'] - pd.to_timedelta(5, unit='min')
n_min_after = df1['created_time'] + pd.to_timedelta(5, unit='min')
time_condition = (df2['updated_at'] <= n_min_after) & (n_min_before <= df2['updated_at'])

# Apply filters and find intersection rows in df2
intersect_df2 = df2[user_id_condition & time_condition][['lat', 'lng', 'user_id']]

# Merge df1 with intersect_df2 (left merge preserves size of df1)
output_df = pd.merge(df1, intersect_df2, on='user_id', how='left')

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM