[英]Modifying the id value of a row based on interval time threshold in date time column
I am working on geolife
dataset , which contains timestamped GPS track of users in a text file ( .plt
).我正在研究
geolife
数据集,其中包含带有时间戳的 GPS 用户在文本文件( .plt
)中的跟踪。 Each text file contains the user's GPS points for one trip.每个文本文件包含一次行程的用户 GPS 点。 I therefore imported the dataset to
postgres
using python script.因此,我使用 python 脚本将数据集导入到
postgres
。
Because the files are named with string of numbers according to the trip's start time (so for example, the file containing the trip in table below is 20070920074804.plt
), I give the trip id ( session_id
) the file name (without the extension).因为文件是根据行程的开始时间用数字字符串命名的(例如,下表中包含行程的文件是
20070920074804.plt
),所以我给行程 ID( session_id
)提供文件名(不带扩展名) . That's the raw GPS in this table trajectories
.这就是这张表中的原始 GPS
trajectories
。
user_id | session_id | timestamp | lat | lon | alt
---------+-------------------+------------------------+-----------+------------+-----
11 | 20070920074804 | 2007-09-20 07:48:04+01 | 28.19737 | 113.006795 | 71
11 | 20070920074804 | 2007-09-20 08:07:09+01 | 28.197685 | 113.006792 | 87
11 | 20070920074804 | 2007-09-20 08:07:10+01 | 28.197685 | 113.00679 | 87
11 | 20070920074804 | 2007-09-20 14:03:50+01 | 28.197342 | 113.007422 | 62
11 | 20070920074804 | 2007-09-20 14:04:59+01 | 28.197108 | 113.00734 | 62
11 | 20070920074804 | 2007-09-20 14:05:01+01 | 28.197088 | 113.00727 | 62
For analysis purposes, I created another table trips_metrics
where I compute trip metrics from trajectories
table and insert the result to trip_metrics
.出于分析目的,我创建了另一个表
trips_metrics
,我从trajectories
表计算行程指标并将结果插入到trip_metrics
。 Among the values I compute are trip distance ( haversine
) and duration ( start time - end time
).我计算的值包括行程距离(
haversine
)和持续时间( start time - end time
)。
Then I noticed something strange, a user took 8hrs
of trip but covers a distance of 321m
.然后我注意到了一些奇怪的事情,一个用户走了
8hrs
,但走了321m
的距离。 Going through the trip file thoroughly I noticed there's jump in trip's time, suggesting a break in the trip (possibly user stays for hours then continue).彻底检查行程文件,我注意到行程的时间有跳跃,表明行程中断(可能用户停留数小时然后继续)。 An example is in
row 3
and row 4
in table above.示例在上表的
row 3
和row 4
。
To get accurate trip time, I need to split trips with these cases, in a way that if the time interval between consecutive rows exceeds 30mins, it should be considered a new trip (thus new ID).为了获得准确的行程时间,我需要将行程与这些情况分开,如果连续行之间的时间间隔超过 30 分钟,则应将其视为新行程(因此是新 ID)。
I intend to add digit ..02, ..03, ..
to the trip's current session_id
in my trajectories
table before actually computing trips metrics (ie modifying the trajectories
table).在实际计算行程指标(即修改
trajectories
表)之前,我打算在我的trajectories
表中添加数字..02, ..03, ..
到行程的当前session_id
。 So for the example in table above, I want to split it this way:因此,对于上表中的示例,我想这样拆分:
user_id | session_id | timestamp | lat | lon | alt
---------+-------------------+------------------------+-----------+------------+-----
11 | 20070920074804 | 2007-09-20 07:48:04+01 | 28.19737 | 113.006795 | 71
11 | 20070920074804 | 2007-09-20 08:07:09+01 | 28.197685 | 113.006792 | 87
11 | 20070920074804 | 2007-09-20 08:07:10+01 | 28.197685 | 113.00679 | 87
11 | 2007092007480402 | 2007-09-20 14:03:50+01 | 28.197342 | 113.007422 | 62
11 | 2007092007480402 | 2007-09-20 14:04:59+01 | 28.197108 | 113.00734 | 62
11 | 2007092007480402 | 2007-09-20 14:05:01+01 | 28.197088 | 113.00727 | 62
Notice how I assign the session_id
for the new trip (since the time in between is more than 30mins).请注意我是如何为新行程分配
session_id
的(因为两者之间的时间超过 30 分钟)。
How can I do this modification or alteration to my raw GPS table ( trajectories
) in postgres
?如何在
postgres
中对我的原始 GPS 表( trajectories
)进行此修改或更改?
EDIT编辑
A: The first query in the answer from @GMB works, however, it gives each row I new session_id
in the new_session_id
column.答: @GMB 答案中的第一个查询有效,但是,它在
new_session_id
列中为每一行提供了新的session_id
。
+---------+----------------+------------------------+-----------+------------+-----+--------+------------------+
| user_id | session_id | timestamp | lat | lon | alt | is_gap | new_session_id |
+---------+----------------+------------------------+-----------+------------+-----+--------+------------------+
| 11 | 20070920074804 | 2007-09-20 07:48:04+01 | 28.19737 | 113.006795 | 71 | | 20070920074804 |
| 11 | 20070920074804 | 2007-09-20 08:07:09+01 | 28.197685 | 113.006792 | 87 | 1 | 2007092007480401 |
| 11 | 20070920074804 | 2007-09-20 08:07:10+01 | 28.197685 | 113.00679 | 87 | 1 | 2007092007480402 |
| 11 | 20070920074804 | 2007-09-20 14:03:50+01 | 28.197342 | 113.007422 | 62 | 1 | 2007092007480403 |
| 11 | 20070920074804 | 2007-09-20 14:04:59+01 | 28.197108 | 113.00734 | 62 | 1 | 2007092007480404 |
| 11 | 20070920074804 | 2007-09-20 14:05:01+01 | 28.197088 | 113.00727 | 62 | 1 | 2007092007480405 |
+---------+----------------+------------------------+-----------+------------+-----+--------+------------------+
Expected Result:预期结果:
+---------+----------------+------------------------+-----------+------------+-----+--------+------------------+
| user_id | session_id | timestamp | lat | lon | alt | is_gap | new_session_id |
+---------+----------------+------------------------+-----------+------------+-----+--------+------------------+
| 11 | 20070920074804 | 2007-09-20 07:48:04+01 | 28.19737 | 113.006795 | 71 | | 20070920074804 |
| 11 | 20070920074804 | 2007-09-20 08:07:09+01 | 28.197685 | 113.006792 | 87 | | 20070920074804 |
| 11 | 20070920074804 | 2007-09-20 08:07:10+01 | 28.197685 | 113.00679 | 87 | 1 | 2007092007480401 |
| 11 | 20070920074804 | 2007-09-20 14:03:50+01 | 28.197342 | 113.007422 | 62 | 1 | 2007092007480401 |
| 11 | 20070920074804 | 2007-09-20 14:04:59+01 | 28.197108 | 113.00734 | 62 | 1 | 2007092007480401 |
| 11 | 20070920074804 | 2007-09-20 14:05:01+01 | 28.197088 | 113.00727 | 62 | 1 | 2007092007480401 |
+---------+----------------+------------------------+-----------+------------+-----+--------+------------------+
The idea is to give the "emerging" trip a new id by old_session_id + 01
.这个想法是通过
old_session_id + 01
给“新兴”旅行一个新的 id。 If another emerging trip is encountered it should be assigned old_session_id + 02
and so on.如果遇到另一个新的行程,它应该被分配
old_session_id + 02
等等。
B: The second query with update option contains a syntax error: B:带有更新选项的第二个查询包含语法错误:
update trajectories t
from (
select
t.*,
case when sum(is_gap) over(partition by session_id order by timestamp) > 0
then session_id * 100 + sum(is_gap) over(partition by session_id order by timestamp)
else session_id
end new_session_id
from (
select
t.*,
(timestamp > lag(timestamp) over(partition by session_id order by timestamp))::int is_gap
from trajectories t
) t
) t1
set session_id = t1.new_session_id
where t1.session_id = t.session_id and t1.timestamp = t.timestamp
ERROR: syntax error at or near "from"
LINE 2: from (
You can use lag()
, a cumulative sum to identify the segements, and then some way of munging the session_id
:您可以使用
lag()
,一个累积总和来识别分段,然后以某种方式修改session_id
:
select (case when grp >= 1 then session_id * 100 + grp
else session_id
end) as new_session_id,
t.*
from (select t.*,
count(*) filter (where prev_ts < timestamp - interval '30 minute') over (partition by session_id, order by timestamp) as grp
from (select t.*,
lag(timestamp) over (partition by session_id order by timestamp) as prev_ts
from trajectories t
) t
) t;
This is a gaps-and-island problem.这是一个差距和孤岛问题。 You want to detect consecutive rows with timestamp difference that is greater than 30 minutes, and then change the
session_id
accordingly.您想检测时间戳差异大于 30 分钟的连续行,然后相应地更改
session_id
。
An option is to use lag()
, and then a cumulative count of the gaps - you can then use that information to compute the new session_id
:一个选项是使用
lag()
,然后使用累积的间隙计数 - 然后您可以使用该信息来计算新的session_id
:
select
t.*,
case when sum(is_gap) over(partition by session_id order by timestamp) > 0
then session_id * 100 + sum(is_gap) over(partition by session_id order by timestamp)
else session_id
end new_session_id
from (
select
t.*,
(timestamp > lag(timestamp) over(partition by session_id order by timestamp))::int is_gap
from trajectories t
) t
You can turn this to an update
statement if needed:如果需要,您可以将其转换为
update
语句:
update trajectories t
set session_id = t1.new_session_id
from (
select
t.*,
case when sum(is_gap) over(partition by session_id order by timestamp) > 0
then session_id * 100 + sum(is_gap) over(partition by session_id order by timestamp)
else session_id
end new_session_id
from (
select
t.*,
(timestamp > lag(timestamp) over(partition by session_id order by timestamp))::int is_gap
from trajectories t
) t
) t1
where t1.session_id = t.session_id and t1.timestamp = t.timestamp
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.