根据日期时间列中的间隔时间阈值修改行的id值

Question

I am working on geolife dataset , which contains timestamped GPS track of users in a text file ( .plt ).我正在研究geolife 数据集，其中包含带有时间戳的 GPS 用户在文本文件（ .plt ）中的跟踪。 Each text file contains the user's GPS points for one trip.每个文本文件包含一次行程的用户 GPS 点。 I therefore imported the dataset to postgres using python script.因此，我使用 python 脚本将数据集导入到postgres 。

Because the files are named with string of numbers according to the trip's start time (so for example, the file containing the trip in table below is 20070920074804.plt ), I give the trip id ( session_id ) the file name (without the extension).因为文件是根据行程的开始时间用数字字符串命名的（例如，下表中包含行程的文件是20070920074804.plt ），所以我给行程 ID（ session_id ）提供文件名（不带扩展名） . That's the raw GPS in this table trajectories .这就是这张表中的原始 GPS trajectories 。

 user_id |    session_id     |       timestamp        |    lat    |    lon     | alt 
---------+-------------------+------------------------+-----------+------------+-----
      11 |    20070920074804 | 2007-09-20 07:48:04+01 |  28.19737 | 113.006795 |  71
      11 |    20070920074804 | 2007-09-20 08:07:09+01 | 28.197685 | 113.006792 |  87
      11 |    20070920074804 | 2007-09-20 08:07:10+01 | 28.197685 |  113.00679 |  87
      11 |    20070920074804 | 2007-09-20 14:03:50+01 | 28.197342 | 113.007422 |  62
      11 |    20070920074804 | 2007-09-20 14:04:59+01 | 28.197108 |  113.00734 |  62
      11 |    20070920074804 | 2007-09-20 14:05:01+01 | 28.197088 |  113.00727 |  62

For analysis purposes, I created another table trips_metrics where I compute trip metrics from trajectories table and insert the result to trip_metrics .出于分析目的，我创建了另一个表trips_metrics ，我从trajectories表计算行程指标并将结果插入到trip_metrics 。 Among the values I compute are trip distance ( haversine ) and duration ( start time - end time ).我计算的值包括行程距离（ haversine ）和持续时间（ start time - end time ）。

Then I noticed something strange, a user took 8hrs of trip but covers a distance of 321m .然后我注意到了一些奇怪的事情，一个用户走了8hrs ，但走了321m的距离。 Going through the trip file thoroughly I noticed there's jump in trip's time, suggesting a break in the trip (possibly user stays for hours then continue).彻底检查行程文件，我注意到行程的时间有跳跃，表明行程中断（可能用户停留数小时然后继续）。 An example is in row 3 and row 4 in table above.示例在上表的row 3和row 4 。

To get accurate trip time, I need to split trips with these cases, in a way that if the time interval between consecutive rows exceeds 30mins, it should be considered a new trip (thus new ID).为了获得准确的行程时间，我需要将行程与这些情况分开，如果连续行之间的时间间隔超过 30 分钟，则应将其视为新行程（因此是新 ID）。

I intend to add digit ..02, ..03, .. to the trip's current session_id in my trajectories table before actually computing trips metrics (ie modifying the trajectories table).在实际计算行程指标（即修改trajectories表）之前，我打算在我的trajectories表中添加数字..02, ..03, ..到行程的当前session_id 。 So for the example in table above, I want to split it this way:因此，对于上表中的示例，我想这样拆分：

 user_id |    session_id     |       timestamp        |    lat    |    lon     | alt 
---------+-------------------+------------------------+-----------+------------+-----
      11 |  20070920074804   | 2007-09-20 07:48:04+01 |  28.19737 | 113.006795 |  71
      11 |  20070920074804   | 2007-09-20 08:07:09+01 | 28.197685 | 113.006792 |  87
      11 |  20070920074804   | 2007-09-20 08:07:10+01 | 28.197685 |  113.00679 |  87
      11 |  2007092007480402 | 2007-09-20 14:03:50+01 | 28.197342 | 113.007422 |  62
      11 |  2007092007480402 | 2007-09-20 14:04:59+01 | 28.197108 |  113.00734 |  62
      11 |  2007092007480402 | 2007-09-20 14:05:01+01 | 28.197088 |  113.00727 |  62

Notice how I assign the session_id for the new trip (since the time in between is more than 30mins).请注意我是如何为新行程分配session_id的（因为两者之间的时间超过 30 分钟）。

How can I do this modification or alteration to my raw GPS table ( trajectories ) in postgres ?如何在postgres中对我的原始 GPS 表（ trajectories ）进行此修改或更改？

EDIT编辑

A: The first query in the answer from @GMB works, however, it gives each row I new session_id in the new_session_id column.答： @GMB 答案中的第一个查询有效，但是，它在new_session_id列中为每一行提供了新的session_id 。

+---------+----------------+------------------------+-----------+------------+-----+--------+------------------+
| user_id |   session_id   |       timestamp        |    lat    |    lon     | alt | is_gap |  new_session_id  |
+---------+----------------+------------------------+-----------+------------+-----+--------+------------------+
|      11 | 20070920074804 | 2007-09-20 07:48:04+01 | 28.19737  | 113.006795 |  71 |        |   20070920074804 |
|      11 | 20070920074804 | 2007-09-20 08:07:09+01 | 28.197685 | 113.006792 |  87 |      1 | 2007092007480401 |
|      11 | 20070920074804 | 2007-09-20 08:07:10+01 | 28.197685 | 113.00679  |  87 |      1 | 2007092007480402 |
|      11 | 20070920074804 | 2007-09-20 14:03:50+01 | 28.197342 | 113.007422 |  62 |      1 | 2007092007480403 |
|      11 | 20070920074804 | 2007-09-20 14:04:59+01 | 28.197108 | 113.00734  |  62 |      1 | 2007092007480404 |
|      11 | 20070920074804 | 2007-09-20 14:05:01+01 | 28.197088 | 113.00727  |  62 |      1 | 2007092007480405 |
+---------+----------------+------------------------+-----------+------------+-----+--------+------------------+

Expected Result:预期结果：

+---------+----------------+------------------------+-----------+------------+-----+--------+------------------+
| user_id |   session_id   |       timestamp        |    lat    |    lon     | alt | is_gap |  new_session_id  |
+---------+----------------+------------------------+-----------+------------+-----+--------+------------------+
|      11 | 20070920074804 | 2007-09-20 07:48:04+01 | 28.19737  | 113.006795 |  71 |        |   20070920074804 |
|      11 | 20070920074804 | 2007-09-20 08:07:09+01 | 28.197685 | 113.006792 |  87 |        |   20070920074804 |
|      11 | 20070920074804 | 2007-09-20 08:07:10+01 | 28.197685 | 113.00679  |  87 |      1 | 2007092007480401 |
|      11 | 20070920074804 | 2007-09-20 14:03:50+01 | 28.197342 | 113.007422 |  62 |      1 | 2007092007480401 |
|      11 | 20070920074804 | 2007-09-20 14:04:59+01 | 28.197108 | 113.00734  |  62 |      1 | 2007092007480401 |
|      11 | 20070920074804 | 2007-09-20 14:05:01+01 | 28.197088 | 113.00727  |  62 |      1 | 2007092007480401 |
+---------+----------------+------------------------+-----------+------------+-----+--------+------------------+

The idea is to give the "emerging" trip a new id by old_session_id + 01 .这个想法是通过old_session_id + 01给“新兴”旅行一个新的 id。 If another emerging trip is encountered it should be assigned old_session_id + 02 and so on.如果遇到另一个新的行程，它应该被分配old_session_id + 02等等。

B: The second query with update option contains a syntax error: B：带有更新选项的第二个查询包含语法错误：

update trajectories t
from (
    select 
        t.*,
        case when sum(is_gap) over(partition by session_id order by timestamp) > 0
            then session_id * 100 + sum(is_gap) over(partition by session_id order by timestamp)
            else session_id
        end new_session_id
    from (
        select
            t.*,
            (timestamp > lag(timestamp) over(partition by session_id order by timestamp))::int is_gap
        from trajectories t
    ) t
) t1
set session_id = t1.new_session_id
where t1.session_id = t.session_id and t1.timestamp = t.timestamp

ERROR:  syntax error at or near "from"
LINE 2: from (

Answer 1

You can use lag() , a cumulative sum to identify the segements, and then some way of munging the session_id :您可以使用lag() ，一个累积总和来识别分段，然后以某种方式修改session_id ：

select (case when grp >= 1 then session_id * 100 + grp
             else session_id
        end) as new_session_id,
       t.*
from (select t.*,
             count(*) filter (where prev_ts < timestamp - interval '30 minute') over (partition by session_id, order by timestamp) as grp
      from (select t.*, 
                   lag(timestamp) over (partition by session_id order by timestamp) as prev_ts
            from trajectories t
           ) t
     ) t;

Here is a db<>fiddle. 这是一个 db<>fiddle。

Answer 2

This is a gaps-and-island problem.这是一个差距和孤岛问题。 You want to detect consecutive rows with timestamp difference that is greater than 30 minutes, and then change the session_id accordingly.您想检测时间戳差异大于 30 分钟的连续行，然后相应地更改session_id 。

An option is to use lag() , and then a cumulative count of the gaps - you can then use that information to compute the new session_id :一个选项是使用lag() ，然后使用累积的间隙计数 - 然后您可以使用该信息来计算新的session_id ：

select 
    t.*,
    case when sum(is_gap) over(partition by session_id order by timestamp) > 0
        then session_id * 100 + sum(is_gap) over(partition by session_id order by timestamp)
        else session_id
    end new_session_id
from (
    select
        t.*,
        (timestamp > lag(timestamp) over(partition by session_id order by timestamp))::int is_gap
    from trajectories t
) t

You can turn this to an update statement if needed:如果需要，您可以将其转换为update语句：

update trajectories t
set session_id = t1.new_session_id
from (
    select 
        t.*,
        case when sum(is_gap) over(partition by session_id order by timestamp) > 0
            then session_id * 100 + sum(is_gap) over(partition by session_id order by timestamp)
            else session_id
        end new_session_id
    from (
        select
            t.*,
            (timestamp > lag(timestamp) over(partition by session_id order by timestamp))::int is_gap
        from trajectories t
    ) t
) t1
where t1.session_id = t.session_id and t1.timestamp = t.timestamp

根据日期时间列中的间隔时间阈值修改行的id值

问题描述

2 个解决方案

解决方案1
1 已采纳 2020-06-18 00:09:37

解决方案2
1 2020-06-18 00:16:30

根据日期时间列中的间隔时间阈值修改行的id值

问题描述

2 个解决方案

解决方案1 1 已采纳 2020-06-18 00:09:37

解决方案2 1 2020-06-18 00:16:30

解决方案1
1 已采纳 2020-06-18 00:09:37

解决方案2
1 2020-06-18 00:16:30