I have a largish data frame in R [300000, 45]. I want to add a column (or create a vector) of TRUE/FALSE where a TRUE is assigned if the value of another column is different than the value above (i-1) and FALSE if they are the same. The basic R code would be:
etS$ar1TF <- NA
mode(etS$ar1TF) <- 'logical'
etS$ar1TF[1] <- TRUE
for(i in 2:length(etS$ar1TF)) {
if(etS$siteYear[i] == etS$siteYear[i-1]) {
etS$ar1TF[i] <- FALSE
} else {
etS$ar1TF[i] <- TRUE
}
}
However, this will be incredibly slow and inefficient. Are there better ways to use existing functions or vectorization to do this quickly and efficiently? I'm not sure if a while()
statement would be any more efficient. I suppose I could start by assigning everything as TRUE then using the if statement within a for loop and removing the else
statement but this really isn't much better. I'm not sure if the apply function would be faster or more efficient in this case because the size and type are already assigned.
Make use of vectorization. Something like below will do the trick:
ar1TF <- logical(length(siteYear))
ar1TF[-1] <- (siteYear[-1] != siteYear[-length(siteYear)])
ar1TF[1] <- NA
etS$ar1TF <- ar1TF # to add the column to the data.frame
EDIT : It seems that the diff
solution may be a bit faster:
x <- sample(1:3, 100000, replace=TRUE)
library('microbenchmark')
microbenchmark({
y1 <- logical(length(x))
y1[-1] <- (x[-1] != x[-length(x)])
y1[1] <- NA
},{
y2 <- diff(x)
y2 <- c(NA, y2 != 0)
})
## Unit: microseconds
## expr min lq median uq max neval
## [!=] 1062.651 1070.690 1088.1935 1169.500 2367.582 100
## [diff] 811.121 821.443 844.3575 892.967 2244.022 100
You could use diff
to perform the differencing:
vec = sample(1:10, 100, replace = TRUE)
diff(vec) == 0
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[25] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[37] FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
[49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
[61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
[73] FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[85] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[97] FALSE FALSE FALSE
The standard setting of diff
uses a lag of 1, which is what you need. To add it to your data.frame
, you need to append an NA
:
df$new_col = c(NA, diff(vec) == 0)
Some basic timings show that this is quite fast, also for larger vectors:
> system.time(dum <- diff(sample(1:10, 10e3, replace = TRUE)) == 0)
user system elapsed
0.001 0.000 0.001
> system.time(dum <- diff(sample(1:10, 10e5, replace = TRUE)) == 0)
user system elapsed
0.189 0.012 0.202
> system.time(dum <- diff(sample(1:10, 10e7, replace = TRUE)) == 0)
user system elapsed
6.810 1.908 10.376
So, with your datasize the processing time should be less than a second. Note that these times include creating the test dataset, so the actually differencing is almost twice as fast.
Performing a direct comparison with a for
loop based solution shows the difference in speed:
diff_for_loop = function(vec) {
result_vec = vec
for(i in seq_along(vec)[-1]) {
if(vec[i] == vec[i-1]) {
result_vec <- FALSE
} else {
result_vec <- TRUE
}
}
return(result_vec)
}
vec = sample(1:10, 10e5, replace = TRUE)
system.time(dum_for_loop <- diff_for_loop(vec))
# user system elapsed
# 1.220 0.008 1.232
system.time(dum_diff <- diff(vec) == 0)
# user system elapsed
# 0.051 0.005 0.056
Which makes the diff
based solution 22 times faster.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.