[英]Replacing row values with the closest conditional values in R
In R, I have a problem that is similar to the one presented here: Replacing NAs in R with nearest value . 在R中,我遇到的问题类似于此处介绍的问题: 用最接近的值替换R中的NA 。 The differences however, are that the values that I want to change are not NAs but any value less than 0, and also that changing these values is dependent on values in another column (so a conditional statement would need to be added).
但是,差异在于我想要更改的值不是NA而是任何小于0的值,并且更改这些值还取决于另一列中的值(因此需要添加条件语句)。 I'm having trouble understanding how to adapt some of the solutions presented in that question to my problem.
我无法理解如何使该问题中提出的一些解决方案适应我的问题。 It's also important that this be speedy as I have a lot of data.
由于我有大量数据,所以这很快也很重要。
sample data 样本数据
pred_trip <- c(0,0,0,0,0,1,1,1,1,0,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1)
locNumb <- c(-1,-1,-1,-1,-1,2,2,2,2,3,3,0,0,0,4,4,4,4,-1,-1,-1,-1,-1,0,0,0,0,0,0,5,5,5,5)
df <- data.frame(pred_trip, locNumb)
So essentially if a value in the locNumb column is <= 0 and there is a 0 in the pred_trip column then the value in the locNumb column gets reassigned to the closest value that is greater than 0. 因此,如果locNumb列中的值<= 0并且pred_trip列中存在0,则locNumb列中的值将重新分配给大于0的最接近值。
Desired output: 期望的输出:
pred_trip locNumb
1 0 2
2 0 2
3 0 2
4 0 2
5 0 2
6 1 2
7 1 2
8 1 2
9 1 2
10 0 3
11 0 3
12 0 3
13 1 0
14 1 0
15 1 4
16 0 4
17 0 4
18 0 4
19 0 4
20 0 4
21 0 4
22 0 4
23 0 4
24 0 4
25 0 4
26 0 4
27 1 0
28 1 0
29 1 0
30 1 5
31 1 5
32 1 5
33 1 5
I'm having trouble adapting the code in the similar solution as it relies a lot on is.na and doesn't include any of the other conditions that I need. 我在类似的解决方案中调整代码时遇到了麻烦,因为它依赖于is.na并且不包括我需要的任何其他条件。 But so in pseudo code something like: (not sure where to add in my other conditional statement of if pred_trip == 0.
但是在伪代码中是这样的:(不知道在我的其他条件语句中添加的地方如果pred_trip == 0。
f1 <- function(df) {
N <- length(df)
na.pos <- which(df$locNumb < 0 (df))
if (length(na.pos) %in% c(0, N)) {
return(df)
}
non.na.pos <- which(!df$locNumb < 0(df))
intervals <- findInterval(na.pos, non.na.pos,
all.inside = TRUE)
left.pos <- non.na.pos[pmax(1, intervals)]
right.pos <- non.na.pos[pmin(N, intervals+1)]
left.dist <- na.pos - left.pos
right.dist <- right.pos - na.pos
df[na.pos] <- ifelse(left.dist <= right.dist,
df[left.pos], df[right.pos])
return(df)
}
Here's one way to do it. 这是一种方法。
rle
will give you run length encodings, from which you can replace the negative values with NA
and then using na.locf
function from zoo
package to carry forward (and carry backward) the nearest non negative values. rle
将为您提供行程编码,您可以使用NA
替换负值,然后使用zoo
包中的na.locf
函数向前移动(并向后移动)最近的非负值。 Finally, inverse.rle
function can create your desired vector back which we can add to our original data.frame df
as newlocNumb
最后,
inverse.rle
函数可以创建所需的向量,我们可以将其添加到原始data.frame df
作为newlocNumb
As for any additional condition can be used to replace back some of the original values in locNumb
column into newlocNumb
column 至于任何附加条件可用于将
locNumb
列中的一些原始值locNumb
为newlocNumb
列
require(zoo)
pred_trip <- c(0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1)
locNumb <- c(-1, -1, -1, -1, -1, 2, 2, 2, 2, 3, 3, 0, 0, 0, 4, 4, 4, 4, -1, -1, -1, -1, -1, 0, 0, 0, 0, 0, 0, 5, 5, 5, 5)
df <- data.frame(pred_trip, locNumb)
RLE <- rle(df$locNumb)
RLE
## Run Length Encoding
## lengths: int [1:8] 5 4 2 3 4 5 6 4
## values : num [1:8] -1 2 3 0 4 -1 0 5
RLE$values[RLE$values < 0] <- NA
while (any(is.na(RLE$values))) {
RLE$values <- na.locf(na.locf(RLE$values, na.rm = FALSE), fromLast = TRUE, na.rm = FALSE)
}
df$newlocNumb <- inverse.rle(RLE)
df
## pred_trip locNumb newlocNumb
## 1 0 -1 2
## 2 0 -1 2
## 3 0 -1 2
## 4 0 -1 2
## 5 0 -1 2
## 6 1 2 2
## 7 1 2 2
## 8 1 2 2
## 9 1 2 2
## 10 0 3 3
## 11 0 3 3
## 12 0 0 0
## 13 1 0 0
## 14 1 0 0
## 15 1 4 4
## 16 0 4 4
## 17 0 4 4
## 18 0 4 4
## 19 0 -1 4
## 20 0 -1 4
## 21 0 -1 4
## 22 0 -1 4
## 23 0 -1 4
## 24 0 0 0
## 25 0 0 0
## 26 0 0 0
## 27 1 0 0
## 28 1 0 0
## 29 1 0 0
## 30 1 5 5
## 31 1 5 5
## 32 1 5 5
## 33 1 5 5
The data.table
library, which is also very efficient with memory usage btw, can be used here - data.table
库,在内存使用方面非常有效btw,可以在这里使用 -
library(data.table)
# converting data.frame to data.table
dt <- data.table(df)
#assigning unique id to each row
dt[,grpno := .I]
# getting all the unique values from the data.table where locNumb > 0
positivelocNumb <- unique(dt[locNumb > 0])
# indexing by grpno, this will be used to help define nearest positive locnumb
setkeyv(positivelocNumb,c('grpno'))
setkeyv(dt,c('grpno'))
# nearest positive value join
dt2 <- positivelocNumb[dt, roll = "nearest"]
Output, where pred_trip.1
and locNumb.1
are the original values and pred_trip
and locNumb
are the closest positive values. 输出,其中
pred_trip.1
和locNumb.1
是原始值, pred_trip
和locNumb
是最接近的正值。 You can exclude the pred_trip
column from being in the merge by creating positivelocNumb
as unique(dt[locNumb > 0,list(locNumb,grpno)])
- 您可以排除
pred_trip
从合并是通过创建列positivelocNumb
作为unique(dt[locNumb > 0,list(locNumb,grpno)])
-
> dt2
grpno pred_trip locNumb pred_trip.1 locNumb.1
1: 1 1 2 0 -1
2: 2 1 2 0 -1
3: 3 1 2 0 -1
4: 4 1 2 0 -1
5: 5 1 2 0 -1
6: 6 1 2 1 2
7: 7 1 2 1 2
8: 8 1 2 1 2
9: 9 1 2 1 2
10: 10 0 3 0 3
11: 11 0 3 0 3
12: 12 0 3 0 0
13: 13 0 3 1 0
14: 14 1 4 1 0
15: 15 1 4 1 4
16: 16 0 4 0 4
17: 17 0 4 0 4
18: 18 0 4 0 4
19: 19 0 4 0 -1
20: 20 0 4 0 -1
21: 21 0 4 0 -1
22: 22 0 4 0 -1
23: 23 0 4 0 -1
24: 24 0 4 0 0
25: 25 1 5 0 0
26: 26 1 5 0 0
27: 27 1 5 1 0
28: 28 1 5 1 0
29: 29 1 5 1 0
30: 30 1 5 1 5
31: 31 1 5 1 5
32: 32 1 5 1 5
33: 33 1 5 1 5
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.