简体   繁体   English

根据条件为新列分配唯一值

[英]Assign a unique value to a new column based on conditions

I have a dataset that summarises car trips, but it does not identify how many unique cars there are.我有一个总结汽车旅行的数据集,但它没有确定有多少辆独特的汽车。 I would like to create a loop/if statement that assigns a unique number based on a location and time a trip starts to figure out an approximate unique number of cars.我想创建一个循环/if 语句,根据旅行开始的位置和时间分配一个唯一编号,以计算出大约唯一的汽车数量。

So for example, if dropoff location of the first car matches pickup location of the second car, and time frame is within 2 minutes, assign the same car number as the first car.因此,例如,如果第一辆车的下车位置与第二辆车的上车位置匹配,并且时间范围在 2 分钟内,则分配与第一辆车相同的车号。 If completely different assign a new number.如果完全不同,分配一个新号码。

I tried different options but can't make it work(a beginner).我尝试了不同的选项,但无法使其正常工作(初学者)。 Any help at this time is greatly appreciated.非常感谢此时的任何帮助。 (R or Python) (R 或 Python)

This is roughly what I have:这大致是我所拥有的:

Pickup time取件时间 Dropoff time还车时间 Pickup location接人的地方 Dropoff location下车地点
2016-06-09 21:06:36 2016-06-09 21:06:36 2016-06-09 21:13:08 2016-06-09 21:13:08 A一种 B
2016-06-09 21:13:31 2016-06-09 21:13:31 2016-06-09 21:23:59 2016-06-09 21:23:59 A一种 C C
2016-06-09 21:13:45 2016-06-09 21:13:45 2016-06-09 21:26:29 2016-06-09 21:26:29 B C C
2016-06-09 21:15:33 2016-06-09 21:15:33 2016-06-09 21:44:31 2016-06-09 21:44:31 A一种 B
2016-06-09 21:24:49 2016-06-09 21:24:49 2016-06-09 21:39:29 2016-06-09 21:39:29 C C D

This is what I would like to achieve:这就是我想要实现的目标:

Pickup time取件时间 Dropoff time还车时间 Pickup location接人的地方 Dropoff location下车地点 Car #车 #
2016-06-09 21:06:36 2016-06-09 21:06:36 2016-06-09 21:13:08 2016-06-09 21:13:08 A一种 B 1 1个
2016-06-09 21:13:31 2016-06-09 21:13:31 2016-06-09 21:23:59 2016-06-09 21:23:59 A一种 C C 2 2个
2016-06-09 21:13:45 2016-06-09 21:13:45 2016-06-09 21:24:29 2016-06-09 21:24:29 B C C 1 1个
2016-06-09 21:15:33 2016-06-09 21:15:33 2016-06-09 21:44:31 2016-06-09 21:44:31 A一种 B 3 3个
2016-06-09 21:24:49 2016-06-09 21:24:49 2016-06-09 21:39:29 2016-06-09 21:39:29 C C D 2 2个

Here is a data.table approach, using a threshold of 120 seconds这是一个 data.table 方法,使用 120 秒的阈值

library(data.table)

# Set threshold (in seconds)
threshold = 120

# Get the car identifier
result=melt(
  setDT(df)[,trip:=.I][df, on=.(`Dropoff location`=`Pickup location`), nomatch=0] %>% 
    .[between(`i.Pickup time`-`Dropoff time`,0,threshold),.(trip,i.trip)] %>% 
    .[,car:=.I],id.vars = "car",value.name="trip"
)[,variable:=NULL][df, on="trip"]

# add any other single-instance cars
result[is.na(car),car:=seq(max(result$car,na.rm=T)+1, length.out = result[is.na(car),.N])]

Output: Output:

     car  trip         Pickup time        Dropoff time Pickup location Dropoff location
   <int> <int>              <POSc>              <POSc>          <char>           <char>
1:     1     1 2016-06-09 21:06:36 2016-06-09 21:13:08               A                B
2:     2     2 2016-06-09 21:13:31 2016-06-09 21:23:59               A                C
3:     1     3 2016-06-09 21:13:45 2016-06-09 21:26:29               B                C
4:     3     4 2016-06-09 21:15:33 2016-06-09 21:44:31               A                B
5:     2     5 2016-06-09 21:24:49 2016-06-09 21:39:29               C                D

Input:输入:

structure(list(`Pickup time` = structure(c(1465506396, 1465506811, 
1465506825, 1465506933, 1465507489), class = c("POSIXct", "POSIXt"
), tzone = "UTC"), `Dropoff time` = structure(c(1465506788, 1465507439, 
1465507589, 1465508671, 1465508369), class = c("POSIXct", "POSIXt"
), tzone = "UTC"), `Pickup location` = c("A", "A", "B", "A", 
"C"), `Dropoff location` = c("B", "C", "C", "B", "D")), row.names = c(NA, 
-5L), class = "data.frame")

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM