简体   繁体   English

根据日期时间列中的间隔时间阈值修改行的id值

[英]Modifying the id value of a row based on interval time threshold in date time column

I am working on geolife dataset , which contains timestamped GPS track of users in a text file ( .plt ).我正在研究geolife 数据集,其中包含带有时间戳的 GPS 用户在文本文件( .plt )中的跟踪。 Each text file contains the user's GPS points for one trip.每个文本文件包含一次行程的用户 GPS 点。 I therefore imported the dataset to postgres using python script.因此,我使用 python 脚本将数据集导入到postgres

Because the files are named with string of numbers according to the trip's start time (so for example, the file containing the trip in table below is 20070920074804.plt ), I give the trip id ( session_id ) the file name (without the extension).因为文件是根据行程的开始时间用数字字符串命名的(例如,下表中包含行程的文件是20070920074804.plt ),所以我给行程 ID( session_id )提供文件名(不带扩展名) . That's the raw GPS in this table trajectories .这就是这张表中的原始 GPS trajectories

 user_id |    session_id     |       timestamp        |    lat    |    lon     | alt 
---------+-------------------+------------------------+-----------+------------+-----
      11 |    20070920074804 | 2007-09-20 07:48:04+01 |  28.19737 | 113.006795 |  71
      11 |    20070920074804 | 2007-09-20 08:07:09+01 | 28.197685 | 113.006792 |  87
      11 |    20070920074804 | 2007-09-20 08:07:10+01 | 28.197685 |  113.00679 |  87
      11 |    20070920074804 | 2007-09-20 14:03:50+01 | 28.197342 | 113.007422 |  62
      11 |    20070920074804 | 2007-09-20 14:04:59+01 | 28.197108 |  113.00734 |  62
      11 |    20070920074804 | 2007-09-20 14:05:01+01 | 28.197088 |  113.00727 |  62

For analysis purposes, I created another table trips_metrics where I compute trip metrics from trajectories table and insert the result to trip_metrics .出于分析目的,我创建了另一个表trips_metrics ,我从trajectories表计算行程指标并将结果插入到trip_metrics Among the values I compute are trip distance ( haversine ) and duration ( start time - end time ).我计算的值包括行程距离( haversine )和持续时间( start time - end time )。

Then I noticed something strange, a user took 8hrs of trip but covers a distance of 321m .然后我注意到了一些奇怪的事情,一个用户走了8hrs ,但走了321m的距离。 Going through the trip file thoroughly I noticed there's jump in trip's time, suggesting a break in the trip (possibly user stays for hours then continue).彻底检查行程文件,我注意到行程的时间有跳跃,表明行程中断(可能用户停留数小时然后继续)。 An example is in row 3 and row 4 in table above.示例在上表的row 3row 4

To get accurate trip time, I need to split trips with these cases, in a way that if the time interval between consecutive rows exceeds 30mins, it should be considered a new trip (thus new ID).为了获得准确的行程时间,我需要将行程与这些情况分开,如果连续行之间的时间间隔超过 30 分钟,则应将其视为新行程(因此是新 ID)。

I intend to add digit ..02, ..03, .. to the trip's current session_id in my trajectories table before actually computing trips metrics (ie modifying the trajectories table).在实际计算行程指标(即修改trajectories表)之前,我打算在我的trajectories表中添加数字..02, ..03, ..到行程的当前session_id So for the example in table above, I want to split it this way:因此,对于上表中的示例,我想这样拆分:

 user_id |    session_id     |       timestamp        |    lat    |    lon     | alt 
---------+-------------------+------------------------+-----------+------------+-----
      11 |  20070920074804   | 2007-09-20 07:48:04+01 |  28.19737 | 113.006795 |  71
      11 |  20070920074804   | 2007-09-20 08:07:09+01 | 28.197685 | 113.006792 |  87
      11 |  20070920074804   | 2007-09-20 08:07:10+01 | 28.197685 |  113.00679 |  87
      11 |  2007092007480402 | 2007-09-20 14:03:50+01 | 28.197342 | 113.007422 |  62
      11 |  2007092007480402 | 2007-09-20 14:04:59+01 | 28.197108 |  113.00734 |  62
      11 |  2007092007480402 | 2007-09-20 14:05:01+01 | 28.197088 |  113.00727 |  62

Notice how I assign the session_id for the new trip (since the time in between is more than 30mins).请注意我是如何为新行程分配session_id的(因为两者之间的时间超过 30 分钟)。

How can I do this modification or alteration to my raw GPS table ( trajectories ) in postgres ?如何在postgres中对我的原始 GPS 表( trajectories )进行此修改或更改?

EDIT编辑

A: The first query in the answer from @GMB works, however, it gives each row I new session_id in the new_session_id column.答: @GMB 答案中的第一个查询有效,但是,它在new_session_id列中为每一行提供了新的session_id

+---------+----------------+------------------------+-----------+------------+-----+--------+------------------+
| user_id |   session_id   |       timestamp        |    lat    |    lon     | alt | is_gap |  new_session_id  |
+---------+----------------+------------------------+-----------+------------+-----+--------+------------------+
|      11 | 20070920074804 | 2007-09-20 07:48:04+01 | 28.19737  | 113.006795 |  71 |        |   20070920074804 |
|      11 | 20070920074804 | 2007-09-20 08:07:09+01 | 28.197685 | 113.006792 |  87 |      1 | 2007092007480401 |
|      11 | 20070920074804 | 2007-09-20 08:07:10+01 | 28.197685 | 113.00679  |  87 |      1 | 2007092007480402 |
|      11 | 20070920074804 | 2007-09-20 14:03:50+01 | 28.197342 | 113.007422 |  62 |      1 | 2007092007480403 |
|      11 | 20070920074804 | 2007-09-20 14:04:59+01 | 28.197108 | 113.00734  |  62 |      1 | 2007092007480404 |
|      11 | 20070920074804 | 2007-09-20 14:05:01+01 | 28.197088 | 113.00727  |  62 |      1 | 2007092007480405 |
+---------+----------------+------------------------+-----------+------------+-----+--------+------------------+

Expected Result:预期结果:

+---------+----------------+------------------------+-----------+------------+-----+--------+------------------+
| user_id |   session_id   |       timestamp        |    lat    |    lon     | alt | is_gap |  new_session_id  |
+---------+----------------+------------------------+-----------+------------+-----+--------+------------------+
|      11 | 20070920074804 | 2007-09-20 07:48:04+01 | 28.19737  | 113.006795 |  71 |        |   20070920074804 |
|      11 | 20070920074804 | 2007-09-20 08:07:09+01 | 28.197685 | 113.006792 |  87 |        |   20070920074804 |
|      11 | 20070920074804 | 2007-09-20 08:07:10+01 | 28.197685 | 113.00679  |  87 |      1 | 2007092007480401 |
|      11 | 20070920074804 | 2007-09-20 14:03:50+01 | 28.197342 | 113.007422 |  62 |      1 | 2007092007480401 |
|      11 | 20070920074804 | 2007-09-20 14:04:59+01 | 28.197108 | 113.00734  |  62 |      1 | 2007092007480401 |
|      11 | 20070920074804 | 2007-09-20 14:05:01+01 | 28.197088 | 113.00727  |  62 |      1 | 2007092007480401 |
+---------+----------------+------------------------+-----------+------------+-----+--------+------------------+

The idea is to give the "emerging" trip a new id by old_session_id + 01 .这个想法是通过old_session_id + 01给“新兴”旅行一个新的 id。 If another emerging trip is encountered it should be assigned old_session_id + 02 and so on.如果遇到另一个新的行程,它应该被分配old_session_id + 02等等。

B: The second query with update option contains a syntax error: B:带有更新选项的第二个查询包含语法错误:

update trajectories t
from (
    select 
        t.*,
        case when sum(is_gap) over(partition by session_id order by timestamp) > 0
            then session_id * 100 + sum(is_gap) over(partition by session_id order by timestamp)
            else session_id
        end new_session_id
    from (
        select
            t.*,
            (timestamp > lag(timestamp) over(partition by session_id order by timestamp))::int is_gap
        from trajectories t
    ) t
) t1
set session_id = t1.new_session_id
where t1.session_id = t.session_id and t1.timestamp = t.timestamp

ERROR:  syntax error at or near "from"
LINE 2: from (

You can use lag() , a cumulative sum to identify the segements, and then some way of munging the session_id :您可以使用lag() ,一个累积总和来识别分段,然后以某种方式修改session_id

select (case when grp >= 1 then session_id * 100 + grp
             else session_id
        end) as new_session_id,
       t.*
from (select t.*,
             count(*) filter (where prev_ts < timestamp - interval '30 minute') over (partition by session_id, order by timestamp) as grp
      from (select t.*, 
                   lag(timestamp) over (partition by session_id order by timestamp) as prev_ts
            from trajectories t
           ) t
     ) t;

Here is a db<>fiddle. 是一个 db<>fiddle。

This is a gaps-and-island problem.这是一个差距和孤岛问题。 You want to detect consecutive rows with timestamp difference that is greater than 30 minutes, and then change the session_id accordingly.您想检测时间戳差异大于 30 分钟的连续行,然后相应地更改session_id

An option is to use lag() , and then a cumulative count of the gaps - you can then use that information to compute the new session_id :一个选项是使用lag() ,然后使用累积的间隙计数 - 然后您可以使用该信息来计算新的session_id

select 
    t.*,
    case when sum(is_gap) over(partition by session_id order by timestamp) > 0
        then session_id * 100 + sum(is_gap) over(partition by session_id order by timestamp)
        else session_id
    end new_session_id
from (
    select
        t.*,
        (timestamp > lag(timestamp) over(partition by session_id order by timestamp))::int is_gap
    from trajectories t
) t

You can turn this to an update statement if needed:如果需要,您可以将其转换为update语句:

update trajectories t
set session_id = t1.new_session_id
from (
    select 
        t.*,
        case when sum(is_gap) over(partition by session_id order by timestamp) > 0
            then session_id * 100 + sum(is_gap) over(partition by session_id order by timestamp)
            else session_id
        end new_session_id
    from (
        select
            t.*,
            (timestamp > lag(timestamp) over(partition by session_id order by timestamp))::int is_gap
        from trajectories t
    ) t
) t1
where t1.session_id = t.session_id and t1.timestamp = t.timestamp

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM