[英]Extracting values from rows which meet a condition in R
I have a big data frame with millions of rows and more than 20 columns. 我有一个大数据框架,其中包含数百万行和20多个列。 Let me first describe what the data is to make question more clear. 首先让我描述一下数据是什么,以便使问题更清楚。 The original data frame consists of locations, velocities and accelerations of 2169 vehicles during a 15 minute period. 原始数据帧包含15分钟内2169辆车的位置,速度和加速度。 Each vehicle has a unique Vehicle.ID
, an ID of the time frame in which it was observed ie Frame.ID
, the velocity of vehicle in that frame ie svel
, the acceleration of vehicle in that frame ie sacc
and the class of that vehicle, vehicle.class
, ie 1= motorcycle, 2= car, 3 = truck. 每辆车都有唯一的Vehicle.ID
,即在其中观察到的时间范围的ID,即Frame.ID
,该帧中的车辆速度,即svel
,在该帧中的车辆加速度,即sacc
和该车辆的类别, vehicle.class
,即1 =摩托车,2 =汽车,3 =卡车。 These variables were recorded after every 0.1 seconds ie each frame is 0.1 seconds. 这些变量每隔0.1秒记录一次,即每帧为0.1秒。 Here are the first 6 rows: 这是前6行:
> dput(head(df))
structure(list(Vehicle.ID = c(2L, 2L, 2L, 2L, 2L, 2L), Frame.ID = 133:138,
Vehicle.class = c(2L, 2L, 2L, 2L, 2L, 2L), Lane = c(2L, 2L,
2L, 2L, 2L, 2L), svel = c(37.29, 37.11, 36.96, 36.83, 36.73,
36.64), sacc = c(0.07, 0.11, 0.15, 0.19, 0.22, 0.25)), .Names = c("Vehicle.ID",
"Frame.ID", "Vehicle.class", "Lane", "svel", "sacc"), row.names = 7750:7755, class = "data.frame")
There are some instances in vehicles' journey during the 15 minute recording period that they completely stop ie svel==0
. 在15分钟的记录期内,车辆的行驶中有一些实例完全停止,即svel==0
。 This continues for some frames and then vehicles gain speed again. 这种情况持续了一些帧,然后车辆再次加速。 For the purpose of reproduciblity I am creating an example data set as follows: 为了重现性,我创建一个示例数据集,如下所示:
x <- data.frame(Vehicle.ID = c(rep(10,5), rep(20,5), rep(30,5), rep(40,5), rep(50,5)),
vehicle.class = c(rep(2,10), rep(3,10),rep(1,5)),
svel = rep(c(1,0,0,0,3),5),
sacc = rep(c(0.3,0.001,0.001,0.002,0.5),5))
As described above some vehicles stop and have zero velocity for some time but later accelerate to get up to speed. 如上所述,一些车辆停止并且在一段时间内具有零速度,但随后加速以达到速度。 I want to find the acceleration, sacc
they apply after having zero velocity for some time (moving from standstill position). 我想找到的加速, sacc
他们有一段时间的零速度后应用(从静止位置移动)。 This means that I should be able to look at the FIRST row AFTER the last frame in which svel==0
. 这意味着我应该能够看到svel==0
的最后一帧之后的第一行。 In the example data this means that the car ( vehicle.class==2
) having a Vehicle.ID==10
had a velocity, svel
equal to 1 as seen in the first row. 在该例子中数据,这意味着车( vehicle.class==2
具有) Vehicle.ID==10
有一个速度, svel
等于1作为第一行中所示。 Later, it stopped for 3 frames (3 consecutive rows) and then accelerated to velocity, svel
, equal to 3. I want the acceleration sacc
it applied in those 2 frames (rows 4 and 5 for vehicle 10, which come out to be 0.002 and 0.500). 后来,它停了3帧(连续3行),然后加速到速度svel
,等于3。我希望它在这2帧(车辆10的第4和第5行,得出的是0.002)中应用加速度sacc
和0.500)。 This means that for example data, following should be the output by vehicle.class
: 这意味着例如数据,以下应该是vehicle.class
的输出:
output <- data.frame(Vehicle.ID = c(10,10,20,20,30,30,40,40,50, 50),
vehicle.class = c(2,2,2,2,3,3,3,3,1,1),
xf = rep(c('l','f'),10),
sacc = rep(c(0.002,0.500),5))
xf
identifies the last row l
in which svel==0
and f
is the first one after that. xf
标识最后一行l
,其中svel==0
而f
是其后的第一行。 I have tried using plyr
and for loop
to split by vehicle.class
but am not sure how to extract the sacc
. 我已经尝试使用plyr
和for loop
的分裂vehicle.class
但我不知道如何提取sacc
。
xf
should be a part of output. xf
应该是输出的一部分。 It is not in given data. 它不在给定数据中。 df
has 2169 vehicles, some stopped and some did not so not all vehicles had svel==0
. 原始数据帧df
有2169辆车,有些停了下来,有些却没有停,因此并非所有车都svel==0
。 svel==0
is different vehicle to vehicle. 此外, svel==0
的行数是不同的车辆。 There may be a more elegant way to do this, but this works: 可能有一种更优雅的方法可以做到这一点,但这可行:
require(data.table)
x <- data.table(x) ## much easier as data.table
x[, xf:='n'] ## create vector with 'n', neither first nor last
# create diff(svel) shifted upwards,
# padding last observation with 0 to avoid cycling
x[, dsvel:=c(diff(svel, lag=1), 0), by=Vehicle.ID]
# svel is zero and dsvel positive at the last 0 value
x[svel==0 & dsvel > 0, xf:='l']
# there may be a better way to do this part
# get index of observation next to 'l'
# there is no risk of spilling to next Vehicle.ID,
# because 'l' can only be second to last
i <- which(x$xf=='l') + 1
x[i, xf:='f']
That should give you the xf
vector you want. 那应该给你想要的xf
向量。
Edit from Arun: +1 @ilir, a very nice answer. 从阿伦编辑:+1 @ilir,一个非常好的答案。 Here's another way you could do it with the use of data.table
's inbuilt variables .I
and .N
: 这是使用data.table
的内置变量.I
和.N
的另一种方式:
idx = x[, {
ix = tail(.I[svel==0L], 1);
iy = (ix+1L)*((ix+1L) <= .I[.N] | NA)
list(idx = c(ix, iy))
}, by = list(Vehicle.ID, vehicle.class)]$idx
You can now subset with idx
add l
and f
with :=
as follows: 您现在可以使用idx
子集通过:=
将l
和f
添加如下:
ans <- x[idx][, xf := c("l", "f")]
Vehicle.ID vehicle.class svel sacc xf
1: 10 2 0 0.002 l
2: 10 2 3 0.500 f
3: 20 2 0 0.002 l
4: 20 2 3 0.500 f
5: 30 3 0 0.002 l
6: 30 3 3 0.500 f
7: 40 3 0 0.002 l
8: 40 3 3 0.500 f
9: 50 1 0 0.002 l
10: 50 1 3 0.500 f
.I
contains the row numbers of x
for each group. .I
包含每个组的x
行号。 .N
contains the number of observations for each group. .N
包含每个组的观察数。 Please read ?data.table
for more. 请阅读?data.table
了解更多信息。
ix
gets the last occurrence of the 0. We subset the row number corresponding to the last 0, for each group, using tail
. ix
获得0的最后一次出现。对于每个组,我们使用tail
子集对应于最后0的行号。
iy
normally should be the next entry = ix+1L
. iy
通常应的下一条目= ix+1L
。 But since the 0 may be the last entry for some group, we check if it is so by comparing (ix+1L) <= .I[.N]
. 但是由于0可能是某个组的最后一个条目,因此我们通过比较(ix+1L) <= .I[.N]
检查是否为(ix+1L) <= .I[.N]
。 If it's FALSE that means ix
is the last entry and so we've to output NA, else we've to output (ix+1L)
. 如果是FALSE,则意味着ix
是最后一个条目,因此我们必须输出NA,否则我们必须输出(ix+1L)
。
HTH. HTH。
I think I've come up with a reasonably elegant way of representing the problem with dplyr. 我想我已经提出了一种相当优雅的方式来代表dplyr问题。 For each car, we're interested in the rows where it's not stopped in this row, but was stopped in the previous row: 对于每辆车,我们都感兴趣的行在该行中没有停止,而是在上一行中停止了:
library(dplyr)
df <- tbl_df(data.frame(
id = c(rep(10, 5), rep(20, 5), rep(30, 5), rep(40, 5), rep(50, 5)),
class = c(rep(2, 10), rep(3, 10), rep(1, 5)),
svel = rep(c(1, 0, 0, 0, 3), 5),
sacc = rep(c(0.3, 0.001, 0.001, 0.002, 0.5), 5)
))
df %.% group_by(id) %.%
mutate(stopped = svel == 0) %.%
filter(lag(stopped) == TRUE, stopped == FALSE)
#> Source: local data frame [5 x 5]
#> Groups: id
#>
#> id class svel sacc stopped
#> 1 10 2 3 0.5 FALSE
#> 2 20 2 3 0.5 FALSE
#> 3 30 3 3 0.5 FALSE
#> 4 40 3 3 0.5 FALSE
#> 5 50 1 3 0.5 FALSE
You could write this a little more compactly as 您可以将其写得更紧凑一些
df %.% group_by(id) %.%
mutate(stopped = svel == 0) %.%
filter(lag(stopped), !stopped)
#> Source: local data frame [5 x 5]
#> Groups: id
#>
#> id class svel sacc stopped
#> 1 10 2 3 0.5 FALSE
#> 2 20 2 3 0.5 FALSE
#> 3 30 3 3 0.5 FALSE
#> 4 40 3 3 0.5 FALSE
#> 5 50 1 3 0.5 FALSE
Not sure I totally understand the question, but I think this is what you are after: 不确定我是否完全理解这个问题,但是我认为这是您的追求:
x <- data.frame(Vehicle.ID = c(rep(10,5), rep(20,5), rep(30,5), rep(40,5), rep(50,5)),
vehicle.class = c(rep(2,10), rep(3,10),rep(1,5)),
svel = rep(c(1,0,0,0,3),5),
sacc = rep(c(0.3,0.001,0.001,0.002,0.5),5)
)
# find "l" rows, the last row for a given Vehicle.ID where svel==0
l <- FALSE
l[x$svel==0] <- !duplicated(x$Vehicle.ID[x$svel==0], fromLast=TRUE)
# extract all rows following an l row.
x[which(l) + 1, c(1, 2, 4)]
library(data.table)
x = data.table(x)
output = x[xf == "f",sacc.after.zero := sacc, by = vehicle.class]
output[!is.na(sacc.after.zero),]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.