简体   繁体   English

用R中最接近的条件值替换行值

[英]Replacing row values with the closest conditional values in R

In R, I have a problem that is similar to the one presented here: Replacing NAs in R with nearest value . 在R中,我遇到的问题类似于此处介绍的问题: 用最接近的值替换R中的NA The differences however, are that the values that I want to change are not NAs but any value less than 0, and also that changing these values is dependent on values in another column (so a conditional statement would need to be added). 但是,差异在于我想要更改的值不是NA而是任何小于0的值,并且更改这些值还取决于另一列中的值(因此需要添加条件语句)。 I'm having trouble understanding how to adapt some of the solutions presented in that question to my problem. 我无法理解如何使该问题中提出的一些解决方案适应我的问题。 It's also important that this be speedy as I have a lot of data. 由于我有大量数据,所以这很快也很重要。

sample data 样本数据

pred_trip <- c(0,0,0,0,0,1,1,1,1,0,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1)
locNumb <- c(-1,-1,-1,-1,-1,2,2,2,2,3,3,0,0,0,4,4,4,4,-1,-1,-1,-1,-1,0,0,0,0,0,0,5,5,5,5)
df <- data.frame(pred_trip, locNumb)

So essentially if a value in the locNumb column is <= 0 and there is a 0 in the pred_trip column then the value in the locNumb column gets reassigned to the closest value that is greater than 0. 因此,如果locNumb列中的值<= 0并且pred_trip列中存在0,则locNumb列中的值将重新分配给大于0的最接近值。

Desired output: 期望的输出:

   pred_trip locNumb
1          0       2
2          0       2
3          0       2
4          0       2
5          0       2
6          1       2
7          1       2
8          1       2
9          1       2
10         0       3
11         0       3
12         0       3
13         1       0
14         1       0
15         1       4
16         0       4
17         0       4
18         0       4
19         0       4
20         0       4
21         0       4
22         0       4
23         0       4
24         0       4
25         0       4
26         0       4
27         1       0
28         1       0
29         1       0
30         1       5
31         1       5
32         1       5
33         1       5

I'm having trouble adapting the code in the similar solution as it relies a lot on is.na and doesn't include any of the other conditions that I need. 我在类似的解决方案中调整代码时遇到了麻烦,因为它依赖于is.na并且不包括我需要的任何其他条件。 But so in pseudo code something like: (not sure where to add in my other conditional statement of if pred_trip == 0. 但是在伪代码中是这样的:(不知道在我的其他条件语句中添加的地方如果pred_trip == 0。

f1 <- function(df) {
  N <- length(df)
  na.pos <- which(df$locNumb < 0 (df))
  if (length(na.pos) %in% c(0, N)) {
    return(df)
  }
  non.na.pos <- which(!df$locNumb < 0(df))
  intervals  <- findInterval(na.pos, non.na.pos,
                             all.inside = TRUE)
  left.pos   <- non.na.pos[pmax(1, intervals)]
  right.pos  <- non.na.pos[pmin(N, intervals+1)]
  left.dist  <- na.pos - left.pos
  right.dist <- right.pos - na.pos

  df[na.pos] <- ifelse(left.dist <= right.dist,
                    df[left.pos], df[right.pos])
  return(df)
}

Here's one way to do it. 这是一种方法。

rle will give you run length encodings, from which you can replace the negative values with NA and then using na.locf function from zoo package to carry forward (and carry backward) the nearest non negative values. rle将为您提供行程编码,您可以使用NA替换负值,然后使用zoo包中的na.locf函数向前移动(并向后移动)最近的非负值。 Finally, inverse.rle function can create your desired vector back which we can add to our original data.frame df as newlocNumb 最后, inverse.rle函数可以创建所需的向量,我们可以将其添加到原始data.frame df作为newlocNumb

As for any additional condition can be used to replace back some of the original values in locNumb column into newlocNumb column 至于任何附加条件可用于将locNumb列中的一些原始值locNumbnewlocNumb

require(zoo)
pred_trip <- c(0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1)
locNumb <- c(-1, -1, -1, -1, -1, 2, 2, 2, 2, 3, 3, 0, 0, 0, 4, 4, 4, 4, -1, -1, -1, -1, -1, 0, 0, 0, 0, 0, 0, 5, 5, 5, 5)
df <- data.frame(pred_trip, locNumb)

RLE <- rle(df$locNumb)

RLE
## Run Length Encoding
##   lengths: int [1:8] 5 4 2 3 4 5 6 4
##   values : num [1:8] -1 2 3 0 4 -1 0 5


RLE$values[RLE$values < 0] <- NA

while (any(is.na(RLE$values))) {
    RLE$values <- na.locf(na.locf(RLE$values, na.rm = FALSE), fromLast = TRUE, na.rm = FALSE)
}

df$newlocNumb <- inverse.rle(RLE)

df
##    pred_trip locNumb newlocNumb
## 1          0      -1          2
## 2          0      -1          2
## 3          0      -1          2
## 4          0      -1          2
## 5          0      -1          2
## 6          1       2          2
## 7          1       2          2
## 8          1       2          2
## 9          1       2          2
## 10         0       3          3
## 11         0       3          3
## 12         0       0          0
## 13         1       0          0
## 14         1       0          0
## 15         1       4          4
## 16         0       4          4
## 17         0       4          4
## 18         0       4          4
## 19         0      -1          4
## 20         0      -1          4
## 21         0      -1          4
## 22         0      -1          4
## 23         0      -1          4
## 24         0       0          0
## 25         0       0          0
## 26         0       0          0
## 27         1       0          0
## 28         1       0          0
## 29         1       0          0
## 30         1       5          5
## 31         1       5          5
## 32         1       5          5
## 33         1       5          5

The data.table library, which is also very efficient with memory usage btw, can be used here - data.table库,在内存使用方面非常有效btw,可以在这里使用 -

library(data.table)

# converting data.frame to data.table
dt <- data.table(df)

#assigning unique id to each row
dt[,grpno := .I]

# getting all the unique values from the data.table where locNumb > 0
positivelocNumb <- unique(dt[locNumb > 0])

# indexing by grpno, this will be used to help define nearest positive locnumb
setkeyv(positivelocNumb,c('grpno'))
setkeyv(dt,c('grpno'))

# nearest positive value join
dt2 <- positivelocNumb[dt, roll = "nearest"]

Output, where pred_trip.1 and locNumb.1 are the original values and pred_trip and locNumb are the closest positive values. 输出,其中pred_trip.1locNumb.1是原始值, pred_triplocNumb是最接近的正值。 You can exclude the pred_trip column from being in the merge by creating positivelocNumb as unique(dt[locNumb > 0,list(locNumb,grpno)]) - 您可以排除pred_trip从合并是通过创建列positivelocNumb作为unique(dt[locNumb > 0,list(locNumb,grpno)]) -

> dt2
    grpno pred_trip locNumb pred_trip.1 locNumb.1
 1:     1         1       2           0        -1
 2:     2         1       2           0        -1
 3:     3         1       2           0        -1
 4:     4         1       2           0        -1
 5:     5         1       2           0        -1
 6:     6         1       2           1         2
 7:     7         1       2           1         2
 8:     8         1       2           1         2
 9:     9         1       2           1         2
10:    10         0       3           0         3
11:    11         0       3           0         3
12:    12         0       3           0         0
13:    13         0       3           1         0
14:    14         1       4           1         0
15:    15         1       4           1         4
16:    16         0       4           0         4
17:    17         0       4           0         4
18:    18         0       4           0         4
19:    19         0       4           0        -1
20:    20         0       4           0        -1
21:    21         0       4           0        -1
22:    22         0       4           0        -1
23:    23         0       4           0        -1
24:    24         0       4           0         0
25:    25         1       5           0         0
26:    26         1       5           0         0
27:    27         1       5           1         0
28:    28         1       5           1         0
29:    29         1       5           1         0
30:    30         1       5           1         5
31:    31         1       5           1         5
32:    32         1       5           1         5
33:    33         1       5           1         5

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM