[英]Assign a unique value to a new column based on conditions
I have a dataset that summarises car trips, but it does not identify how many unique cars there are.我有一个总结汽车旅行的数据集,但它没有确定有多少辆独特的汽车。 I would like to create a loop/if statement that assigns a unique number based on a location and time a trip starts to figure out an approximate unique number of cars.我想创建一个循环/if 语句,根据旅行开始的位置和时间分配一个唯一编号,以计算出大约唯一的汽车数量。
So for example, if dropoff location of the first car matches pickup location of the second car, and time frame is within 2 minutes, assign the same car number as the first car.因此,例如,如果第一辆车的下车位置与第二辆车的上车位置匹配,并且时间范围在 2 分钟内,则分配与第一辆车相同的车号。 If completely different assign a new number.如果完全不同,分配一个新号码。
I tried different options but can't make it work(a beginner).我尝试了不同的选项,但无法使其正常工作(初学者)。 Any help at this time is greatly appreciated.非常感谢此时的任何帮助。 (R or Python) (R 或 Python)
This is roughly what I have:这大致是我所拥有的:
Pickup time取件时间 | Dropoff time还车时间 | Pickup location接人的地方 | Dropoff location下车地点 |
---|---|---|---|
2016-06-09 21:06:36 2016-06-09 21:06:36 | 2016-06-09 21:13:08 2016-06-09 21:13:08 | A一种 | B乙 |
2016-06-09 21:13:31 2016-06-09 21:13:31 | 2016-06-09 21:23:59 2016-06-09 21:23:59 | A一种 | C C |
2016-06-09 21:13:45 2016-06-09 21:13:45 | 2016-06-09 21:26:29 2016-06-09 21:26:29 | B乙 | C C |
2016-06-09 21:15:33 2016-06-09 21:15:33 | 2016-06-09 21:44:31 2016-06-09 21:44:31 | A一种 | B乙 |
2016-06-09 21:24:49 2016-06-09 21:24:49 | 2016-06-09 21:39:29 2016-06-09 21:39:29 | C C | D丁 |
This is what I would like to achieve:这就是我想要实现的目标:
Pickup time取件时间 | Dropoff time还车时间 | Pickup location接人的地方 | Dropoff location下车地点 | Car #车 # |
---|---|---|---|---|
2016-06-09 21:06:36 2016-06-09 21:06:36 | 2016-06-09 21:13:08 2016-06-09 21:13:08 | A一种 | B乙 | 1 1个 |
2016-06-09 21:13:31 2016-06-09 21:13:31 | 2016-06-09 21:23:59 2016-06-09 21:23:59 | A一种 | C C | 2 2个 |
2016-06-09 21:13:45 2016-06-09 21:13:45 | 2016-06-09 21:24:29 2016-06-09 21:24:29 | B乙 | C C | 1 1个 |
2016-06-09 21:15:33 2016-06-09 21:15:33 | 2016-06-09 21:44:31 2016-06-09 21:44:31 | A一种 | B乙 | 3 3个 |
2016-06-09 21:24:49 2016-06-09 21:24:49 | 2016-06-09 21:39:29 2016-06-09 21:39:29 | C C | D丁 | 2 2个 |
Here is a data.table approach, using a threshold of 120 seconds这是一个 data.table 方法,使用 120 秒的阈值
library(data.table)
# Set threshold (in seconds)
threshold = 120
# Get the car identifier
result=melt(
setDT(df)[,trip:=.I][df, on=.(`Dropoff location`=`Pickup location`), nomatch=0] %>%
.[between(`i.Pickup time`-`Dropoff time`,0,threshold),.(trip,i.trip)] %>%
.[,car:=.I],id.vars = "car",value.name="trip"
)[,variable:=NULL][df, on="trip"]
# add any other single-instance cars
result[is.na(car),car:=seq(max(result$car,na.rm=T)+1, length.out = result[is.na(car),.N])]
Output: Output:
car trip Pickup time Dropoff time Pickup location Dropoff location
<int> <int> <POSc> <POSc> <char> <char>
1: 1 1 2016-06-09 21:06:36 2016-06-09 21:13:08 A B
2: 2 2 2016-06-09 21:13:31 2016-06-09 21:23:59 A C
3: 1 3 2016-06-09 21:13:45 2016-06-09 21:26:29 B C
4: 3 4 2016-06-09 21:15:33 2016-06-09 21:44:31 A B
5: 2 5 2016-06-09 21:24:49 2016-06-09 21:39:29 C D
Input:输入:
structure(list(`Pickup time` = structure(c(1465506396, 1465506811,
1465506825, 1465506933, 1465507489), class = c("POSIXct", "POSIXt"
), tzone = "UTC"), `Dropoff time` = structure(c(1465506788, 1465507439,
1465507589, 1465508671, 1465508369), class = c("POSIXct", "POSIXt"
), tzone = "UTC"), `Pickup location` = c("A", "A", "B", "A",
"C"), `Dropoff location` = c("B", "C", "C", "B", "D")), row.names = c(NA,
-5L), class = "data.frame")
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.