简体   繁体   中英

Moving values between rows without a for loop in R

I have written some code used to organize data sampled at different frequencies, but I made extensive use of for-loops, which slow the code's operation down significantly when the data set is large. I've been going through my code, finding ways to remove for-loops to speed it up, but one of the loops has got me stumped.

As an example, let's say the data was sampled at 3Hz, so I get three rows for every second of data. However, the variables A, B, and C are sampled at 1Hz each, so I will get one value every three rows for each of them. The variables are sampled consecutively within the one second period, resulting in a diagonal nature to the data.

To further complicate things, sometimes a row is lost in the original data set.

My goal is this: Having identified the rows that I wish to keep, I want to move the non-NA values from the subsequent rows up into the keeper rows. If it weren't for the lost data issue, I would always keep the row containing a value for the first variable, but if one of these rows is lost, I will be keeping the next row.

In the example below, the sixth sample and the tenth sample are lost.

A <- c(1, NA, NA, 4, NA, 7, NA, NA, NA, NA)
B <- c(NA, 2, NA, NA, 5, NA, 8, NA, 11, NA)
C <- c(NA, NA, 3, NA, NA, NA, NA, 9, NA, 12)

test_df <- data.frame(A = A, B = B, C = C)

test_df
     A  B  C
 1   1 NA NA
 2  NA  2 NA
 3  NA NA  3
 4   4 NA NA
 5  NA  5 NA
 6   7 NA NA
 7  NA  8 NA
 8  NA NA  9
 9  NA 11 NA
10  NA NA 12

keep_rows <- c(1, 4, 6, 9)

After I move the values up into the keeper rows, I will remove the interim rows, resulting in the following:

test_df <- test_df[keep_rows, ]
test_df
     A  B  C
 1   1  2  3
 2   4  5 NA
 3   7  8  9
 4  NA 11 12

In the end, I only want one row for each second of data, and NA values should only remain where a row of the original data was lost.

Does anyone have any ideas of how to move the data up without using a for-loop? I'd appreciate any help! Sorry if this question is too wordy; I wanted to err on the side of too much information rather than not enough.

This should do it:

test_df = with(test_df, cbind(A[1:(length(A)-2)], B[2:(length(B)-1)], C[3:length(C)]))
test_df = data.frame(test_df[!apply(test_df, 1, function(x) all(is.na(x))), ])
colnames(test_df) = c('A', 'B', 'C')
> test_df
   A  B  C
1  1  2  3
2  4  5 NA
3  7  8  9
4 NA 11 12

And if you want something even faster :

test_df = data.frame(test_df[rowSums(is.na(test_df)) != ncol(test_df), ])

Building on the great answer by @John Colby, we can get rid of the apply step and speed it up quite a bit (about 20x):

# Create a bigger test set 
A <- c(1, NA, NA, 4, NA, 7, NA, NA, NA, NA)
B <- c(NA, 2, NA, NA, 5, NA, 8, NA, 11, NA)
C <- c(NA, NA, 3, NA, NA, NA, NA, 9, NA, 12)
n=1e6; test_df = data.frame(A=rep(A, len=n), B=rep(B, len=n), C=rep(C, len=n))

# John Colby's method, 9.66 secs
system.time({
  df1 = with(test_df, cbind(A[1:(length(A)-2)], B[2:(length(B)-1)], C[3:length(C)]))
  df1 = data.frame(df1[!apply(df1, 1, function(x) all(is.na(x))), ])
  colnames(df1) = c('A', 'B', 'C')
})

# My method, 0.48 secs
system.time({
  df2 = with(test_df, data.frame(A=A[1:(length(A)-2)], B=B[2:(length(B)-1)], C=C[3:length(C)]))
  df2 = df2[is.finite(with(df2, A|B|C)),]
  row.names(df2) <- NULL
})

identical(df1, df2) # TRUE

...The trick here is that A|B|C is only NA if all values are NA . This turns out to be much faster than calling all(is.na(x)) on each row of a matrix using apply .

EDIT @John has a different approach that also speeds it up. I added some code to turn the result into a data.frame with correct names and timed it. It seems to be pretty much the same speed as my solution.

# John's method, 0.50 secs
system.time({
  test_m = with(test_df, cbind(A[1:(length(A)-2)], B[2:(length(B)-1)], C[3:length(C)]))
  test_m[is.na(test_m)] <- -1
  test_m <- test_m[rowSums(test_m) > -3,]
  test_m[test_m == -1] <- NA
  df3 <- data.frame(test_m)
  colnames(df3) = c('A', 'B', 'C')
})

identical(df1, df3) # TRUE

EDIT AGAIN ...and @John Colby's updated answer is even faster!

# John Colby's method, 0.39 secs
system.time({
  df4 = with(test_df, cbind(A[1:(length(A)-2)], B[2:(length(B)-1)], C[3:length(C)]))
  df4 = data.frame(df4[rowSums(is.na(df4)) != ncol(df4), ])
  colnames(df4) = c('A', 'B', 'C')
})

identical(df1, df4) # TRUE

So your question was just about moving up without a loop. So apparently you solved the first step already.

> test_m <- with( test_df, cbind(A[1:(length(A)-2)], B[2:(length(B)-1)], C[3:length(C)]) )
> test_m
     [,1] [,2] [,3]
[1,]    1    2    3
[2,]   NA   NA   NA
[3,]   NA   NA   NA
[4,]    4    5   NA
[5,]   NA   NA   NA
[6,]    7    8    9
[7,]   NA   NA   NA
[8,]   NA   11   12

Which is now a matrix. You can easily eliminate the rows for which there is no data point now without a loop. If you want it back to a data.frame then you could use a different method but this one will run the fastest for a large mount of data. I like to just make the NA's an impossible value... perhaps -1 but you'll know best for your data... perhaps -pi.

test_m[is.na(test_m)] <- -1

And now just select the rows for a property of those impossible numbers

test_m <- test_m[rowSums(test_m) > -3,]

And, if you want you can put the NA's back.

test_m[test_m == -1] <- NA
test_m
     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    4    5   NA
[3,]    7    8    9
[4,]   NA   11   12

There's no loop ( for or apply ) and the one function applied across rows of the matrix is specially optimized and runs very fast (rowSums).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM