根据条件为新列分配唯一值

Question

I have a dataset that summarises car trips, but it does not identify how many unique cars there are.我有一个总结汽车旅行的数据集，但它没有确定有多少辆独特的汽车。 I would like to create a loop/if statement that assigns a unique number based on a location and time a trip starts to figure out an approximate unique number of cars.我想创建一个循环/if 语句，根据旅行开始的位置和时间分配一个唯一编号，以计算出大约唯一的汽车数量。

So for example, if dropoff location of the first car matches pickup location of the second car, and time frame is within 2 minutes, assign the same car number as the first car.因此，例如，如果第一辆车的下车位置与第二辆车的上车位置匹配，并且时间范围在 2 分钟内，则分配与第一辆车相同的车号。 If completely different assign a new number.如果完全不同，分配一个新号码。

I tried different options but can't make it work(a beginner).我尝试了不同的选项，但无法使其正常工作（初学者）。 Any help at this time is greatly appreciated.非常感谢此时的任何帮助。 (R or Python) （R 或 Python）

This is roughly what I have:这大致是我所拥有的：

Pickup time取件时间	Dropoff time还车时间	Pickup location接人的地方	Dropoff location下车地点
2016-06-09 21:06:36 2016-06-09 21:06:36	2016-06-09 21:13:08 2016-06-09 21:13:08	A一种	B乙
2016-06-09 21:13:31 2016-06-09 21:13:31	2016-06-09 21:23:59 2016-06-09 21:23:59	A一种	C C
2016-06-09 21:13:45 2016-06-09 21:13:45	2016-06-09 21:26:29 2016-06-09 21:26:29	B乙	C C
2016-06-09 21:15:33 2016-06-09 21:15:33	2016-06-09 21:44:31 2016-06-09 21:44:31	A一种	B乙
2016-06-09 21:24:49 2016-06-09 21:24:49	2016-06-09 21:39:29 2016-06-09 21:39:29	C C	D丁

This is what I would like to achieve:这就是我想要实现的目标：

Pickup time取件时间	Dropoff time还车时间	Pickup location接人的地方	Dropoff location下车地点	Car #车＃
2016-06-09 21:06:36 2016-06-09 21:06:36	2016-06-09 21:13:08 2016-06-09 21:13:08	A一种	B乙	1 1个
2016-06-09 21:13:31 2016-06-09 21:13:31	2016-06-09 21:23:59 2016-06-09 21:23:59	A一种	C C	2 2个
2016-06-09 21:13:45 2016-06-09 21:13:45	2016-06-09 21:24:29 2016-06-09 21:24:29	B乙	C C	1 1个
2016-06-09 21:15:33 2016-06-09 21:15:33	2016-06-09 21:44:31 2016-06-09 21:44:31	A一种	B乙	3 3个
2016-06-09 21:24:49 2016-06-09 21:24:49	2016-06-09 21:39:29 2016-06-09 21:39:29	C C	D丁	2 2个

Answer 1

Here is a data.table approach, using a threshold of 120 seconds这是一个 data.table 方法，使用 120 秒的阈值

library(data.table)

# Set threshold (in seconds)
threshold = 120

# Get the car identifier
result=melt(
  setDT(df)[,trip:=.I][df, on=.(`Dropoff location`=`Pickup location`), nomatch=0] %>% 
    .[between(`i.Pickup time`-`Dropoff time`,0,threshold),.(trip,i.trip)] %>% 
    .[,car:=.I],id.vars = "car",value.name="trip"
)[,variable:=NULL][df, on="trip"]

# add any other single-instance cars
result[is.na(car),car:=seq(max(result$car,na.rm=T)+1, length.out = result[is.na(car),.N])]

Output: Output：

     car  trip         Pickup time        Dropoff time Pickup location Dropoff location
   <int> <int>              <POSc>              <POSc>          <char>           <char>
1:     1     1 2016-06-09 21:06:36 2016-06-09 21:13:08               A                B
2:     2     2 2016-06-09 21:13:31 2016-06-09 21:23:59               A                C
3:     1     3 2016-06-09 21:13:45 2016-06-09 21:26:29               B                C
4:     3     4 2016-06-09 21:15:33 2016-06-09 21:44:31               A                B
5:     2     5 2016-06-09 21:24:49 2016-06-09 21:39:29               C                D

Input:输入：

structure(list(`Pickup time` = structure(c(1465506396, 1465506811, 
1465506825, 1465506933, 1465507489), class = c("POSIXct", "POSIXt"
), tzone = "UTC"), `Dropoff time` = structure(c(1465506788, 1465507439, 
1465507589, 1465508671, 1465508369), class = c("POSIXct", "POSIXt"
), tzone = "UTC"), `Pickup location` = c("A", "A", "B", "A", 
"C"), `Dropoff location` = c("B", "C", "C", "B", "D")), row.names = c(NA, 
-5L), class = "data.frame")

根据条件为新列分配唯一值

问题描述

1 个解决方案

解决方案1
1 已采纳 2022-04-25 20:18:12

根据条件为新列分配唯一值

问题描述

1 个解决方案

解决方案1 1 已采纳 2022-04-25 20:18:12

解决方案1
1 已采纳 2022-04-25 20:18:12