简体   繁体   English

从满足R中条件的行中提取值

[英]Extracting values from rows which meet a condition in R

The data set 数据集

I have a big data frame with millions of rows and more than 20 columns. 我有一个大数据框架,其中包含数百万行和20多个列。 Let me first describe what the data is to make question more clear. 首先让我描述一下数据是什么,以便使问题更清楚。 The original data frame consists of locations, velocities and accelerations of 2169 vehicles during a 15 minute period. 原始数据帧包含15分钟内2169辆车的位置,速度和加速度。 Each vehicle has a unique Vehicle.ID , an ID of the time frame in which it was observed ie Frame.ID , the velocity of vehicle in that frame ie svel , the acceleration of vehicle in that frame ie sacc and the class of that vehicle, vehicle.class , ie 1= motorcycle, 2= car, 3 = truck. 每辆车都有唯一的Vehicle.ID ,即在其中观察到的时间范围的ID,即Frame.ID ,该帧中的车辆速度,即svel ,在该帧中的车辆加速度,即sacc和该车辆的类别, vehicle.class ,即1 =摩托车,2 =汽车,3 =卡车。 These variables were recorded after every 0.1 seconds ie each frame is 0.1 seconds. 这些变量每隔0.1秒记录一次,即每帧为0.1秒。 Here are the first 6 rows: 这是前6行:

> dput(head(df))
structure(list(Vehicle.ID = c(2L, 2L, 2L, 2L, 2L, 2L), Frame.ID = 133:138, 
    Vehicle.class = c(2L, 2L, 2L, 2L, 2L, 2L), Lane = c(2L, 2L, 
    2L, 2L, 2L, 2L), svel = c(37.29, 37.11, 36.96, 36.83, 36.73, 
    36.64), sacc = c(0.07, 0.11, 0.15, 0.19, 0.22, 0.25)), .Names = c("Vehicle.ID", 
"Frame.ID", "Vehicle.class", "Lane", "svel", "sacc"), row.names = 7750:7755, class = "data.frame")

There are some instances in vehicles' journey during the 15 minute recording period that they completely stop ie svel==0 . 在15分钟的记录期内,车辆的行驶中有一些实例完全停止,即svel==0 This continues for some frames and then vehicles gain speed again. 这种情况持续了一些帧,然后车辆再次加速。 For the purpose of reproduciblity I am creating an example data set as follows: 为了重现性,我创建一个示例数据集,如下所示:

x <- data.frame(Vehicle.ID = c(rep(10,5), rep(20,5), rep(30,5), rep(40,5), rep(50,5)),
                    vehicle.class = c(rep(2,10), rep(3,10),rep(1,5)),
                    svel = rep(c(1,0,0,0,3),5),
                    sacc = rep(c(0.3,0.001,0.001,0.002,0.5),5))

What do I want to find? 我想找到什么?

As described above some vehicles stop and have zero velocity for some time but later accelerate to get up to speed. 如上所述,一些车辆停止并且在一段时间内具有零速度,但随后加速以达到速度。 I want to find the acceleration, sacc they apply after having zero velocity for some time (moving from standstill position). 我想找到的加速, sacc他们有一段时间的零速度后应用(从静止位置移动)。 This means that I should be able to look at the FIRST row AFTER the last frame in which svel==0 . 这意味着我应该能够看到svel==0的最后一帧之后的第一行。 In the example data this means that the car ( vehicle.class==2 ) having a Vehicle.ID==10 had a velocity, svel equal to 1 as seen in the first row. 在该例子中数据,这意味着车( vehicle.class==2具有) Vehicle.ID==10有一个速度, svel等于1作为第一行中所示。 Later, it stopped for 3 frames (3 consecutive rows) and then accelerated to velocity, svel , equal to 3. I want the acceleration sacc it applied in those 2 frames (rows 4 and 5 for vehicle 10, which come out to be 0.002 and 0.500). 后来,它停了3帧(连续3行),然后加速到速度svel ,等于3。我希望它在这2帧(车辆10的第4和第5行,得出的是0.002)中应用加速度sacc和0.500)。 This means that for example data, following should be the output by vehicle.class : 这意味着例如数据,以下应该是vehicle.class的输出:

output <- data.frame(Vehicle.ID = c(10,10,20,20,30,30,40,40,50, 50),
                     vehicle.class = c(2,2,2,2,3,3,3,3,1,1),
                     xf = rep(c('l','f'),10),
                     sacc = rep(c(0.002,0.500),5))

xf identifies the last row l in which svel==0 and f is the first one after that. xf标识最后一行l ,其中svel==0f是其后的第一行。 I have tried using plyr and for loop to split by vehicle.class but am not sure how to extract the sacc . 我已经尝试使用plyrfor loop的分裂vehicle.class但我不知道如何提取sacc

Note 注意

  1. xf should be a part of output. xf应该是输出的一部分。 It is not in given data. 它不在给定数据中。
  2. The original data frame df has 2169 vehicles, some stopped and some did not so not all vehicles had svel==0 . 原始数据帧df有2169辆车,有些停了下来,有些却没有停,因此并非所有车都svel==0
  3. The vehicles which did stop didn't do it at the same time. 停车的车辆没有同时停车。 Also, the number of rows in which svel==0 is different vehicle to vehicle. 此外, svel==0的行数是不同的车辆。

There may be a more elegant way to do this, but this works: 可能有一种更优雅的方法可以做到这一点,但这可行:

require(data.table)
x <- data.table(x)  ## much easier as data.table
x[, xf:='n']        ## create vector with 'n', neither first nor last

# create diff(svel) shifted upwards, 
# padding last observation with 0 to avoid cycling
x[, dsvel:=c(diff(svel, lag=1), 0), by=Vehicle.ID]

# svel is zero and dsvel positive at the last 0 value
x[svel==0 & dsvel > 0, xf:='l']

# there may be a better way to do this part
# get index of observation next to 'l'
# there is no risk of spilling to next Vehicle.ID,  
# because 'l' can only be second to last
i <- which(x$xf=='l') + 1
x[i, xf:='f']

That should give you the xf vector you want. 那应该给你想要的xf向量。


Edit from Arun: +1 @ilir, a very nice answer. 从阿伦编辑:+1 @ilir,一个非常好的答案。 Here's another way you could do it with the use of data.table 's inbuilt variables .I and .N : 这是使用data.table的内置变量.I.N的另一种方式:

idx = x[, {
            ix = tail(.I[svel==0L], 1);
            iy = (ix+1L)*((ix+1L) <= .I[.N] | NA) 
            list(idx = c(ix, iy))
          }, by = list(Vehicle.ID, vehicle.class)]$idx

You can now subset with idx add l and f with := as follows: 您现在可以使用idx子集通过:=lf添加如下:

ans <- x[idx][, xf := c("l", "f")]
    Vehicle.ID vehicle.class svel  sacc xf
 1:         10             2    0 0.002  l
 2:         10             2    3 0.500  f
 3:         20             2    0 0.002  l
 4:         20             2    3 0.500  f
 5:         30             3    0 0.002  l
 6:         30             3    3 0.500  f
 7:         40             3    0 0.002  l
 8:         40             3    3 0.500  f
 9:         50             1    0 0.002  l
10:         50             1    3 0.500  f

.I contains the row numbers of x for each group. .I包含每个组的x行号。 .N contains the number of observations for each group. .N包含每个组的观察数。 Please read ?data.table for more. 请阅读?data.table了解更多信息。

ix gets the last occurrence of the 0. We subset the row number corresponding to the last 0, for each group, using tail . ix获得0的最后一次出现。对于每个组,我们使用tail子集对应于最后0的行号。

iy normally should be the next entry = ix+1L . iy通常应的下一条目= ix+1L But since the 0 may be the last entry for some group, we check if it is so by comparing (ix+1L) <= .I[.N] . 但是由于0可能是某个组的最后一个条目,因此我们通过比较(ix+1L) <= .I[.N]检查是否为(ix+1L) <= .I[.N] If it's FALSE that means ix is the last entry and so we've to output NA, else we've to output (ix+1L) . 如果是FALSE,则意味着ix是最后一个条目,因此我们必须输出NA,否则我们必须输出(ix+1L)

HTH. HTH。

I think I've come up with a reasonably elegant way of representing the problem with dplyr. 我想我已经提出了一种相当优雅的方式来代表dplyr问题。 For each car, we're interested in the rows where it's not stopped in this row, but was stopped in the previous row: 对于每辆车,我们都感兴趣的行在该行中没有停止,而是在上一行中停止了:

library(dplyr)
df <- tbl_df(data.frame(
  id = c(rep(10, 5), rep(20, 5), rep(30, 5), rep(40, 5), rep(50, 5)), 
  class = c(rep(2, 10), rep(3, 10), rep(1, 5)), 
  svel = rep(c(1, 0, 0, 0, 3), 5), 
  sacc = rep(c(0.3, 0.001, 0.001, 0.002, 0.5), 5)
))

df %.% group_by(id) %.% 
  mutate(stopped = svel == 0) %.%
  filter(lag(stopped) == TRUE, stopped == FALSE)

#> Source: local data frame [5 x 5]
#> Groups: id
#> 
#>   id class svel sacc stopped
#> 1 10     2    3  0.5   FALSE
#> 2 20     2    3  0.5   FALSE
#> 3 30     3    3  0.5   FALSE
#> 4 40     3    3  0.5   FALSE
#> 5 50     1    3  0.5   FALSE

You could write this a little more compactly as 您可以将其写得更紧凑一些

df %.% group_by(id) %.% 
  mutate(stopped = svel == 0) %.%
  filter(lag(stopped), !stopped)

#> Source: local data frame [5 x 5]
#> Groups: id
#> 
#>   id class svel sacc stopped
#> 1 10     2    3  0.5   FALSE
#> 2 20     2    3  0.5   FALSE
#> 3 30     3    3  0.5   FALSE
#> 4 40     3    3  0.5   FALSE
#> 5 50     1    3  0.5   FALSE

Not sure I totally understand the question, but I think this is what you are after: 不确定我是否完全理解这个问题,但是我认为这是您的追求:

x <- data.frame(Vehicle.ID = c(rep(10,5), rep(20,5), rep(30,5), rep(40,5), rep(50,5)),
                vehicle.class = c(rep(2,10), rep(3,10),rep(1,5)),
                svel = rep(c(1,0,0,0,3),5),
                sacc = rep(c(0.3,0.001,0.001,0.002,0.5),5)
)

# find "l" rows, the last row for a given Vehicle.ID where svel==0
l <- FALSE
l[x$svel==0] <- !duplicated(x$Vehicle.ID[x$svel==0], fromLast=TRUE)
# extract all rows following an l row.
x[which(l) + 1, c(1, 2, 4)]
library(data.table)
x = data.table(x)
output = x[xf == "f",sacc.after.zero := sacc, by = vehicle.class]
output[!is.na(sacc.after.zero),]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM