简体   繁体   中英

sieve out non-NA entries from data frame while retaining rows with only NA

I am looking for a more efficient way (in terms of length of code) of converting a data.frame from:

#   V1 V2 V3 V4 V5 V6 V7 V8 V9
# 1  1  2  3 NA NA NA NA NA NA
# 2 NA NA NA  3  2  1 NA NA NA
# 3 NA NA NA NA NA NA NA NA NA
# 4 NA NA NA NA NA NA NA NA NA
# 5 NA NA NA NA NA NA 1  2  3

to

#     [,1] [,2] [,3]
#[1,]    1    2    3
#[2,]    3    2    1
#[3,]   NA   NA   NA
#[4,]   NA   NA   NA
#[5,]    1    2    3

That is, I want to remove excess NAs but correctly represent rows with only NAs.

I wrote the following function which does the job, but I am sure there is a less lengthy way of achieving the same.

#Dummy data.frame
data <- matrix(c(1:3, rep(NA, 6), 
          rep(NA, 3), 3:1, rep(NA, 3), 
          rep(NA, 9),
          rep(NA, 9),
          rep(NA, 6), 1:3),
          byrow=TRUE, ncol=9)
data <- as.data.frame(data)

sieve <- function(data) {

        #get a list of all entries that are not NA
        cond <- apply(data, 1, function(x) x[!is.na(x)])
        #set integer(0) equal to NA
        cond[sapply(cond, function(x) length(x)==0)] <- NA

        #check how many items there are in non-empty rows
        #(rows are either empty or contain the same number of items)
        n <- max(sapply(cond, length))

        #replace single NA with n NAs, where n = number of items
        #first get an index of entries with single NAs
        index <- (1:length(cond)) [sapply(cond, function(x) length(x)==1)]
        #then replace each entry with n NAs
        for (i in index) cond[[i]]  <- rep(NA, n)

        #turn list into a data.frame
        cond <- matrix(unlist(cond), nrow=length(cond), byrow=TRUE)
        cond
}

sieve(data)

My question resembles this question about extracting conditions to which participants are assigned (for which I received great answers). I tried expanding these answers to the current dummy data, but without success so far. Hence my rather lengthy custom function.


Edit: Additional info for why I am asking this question: The first data frame represents the raw output from an experiment in which I assigned participants to one of three conditions (using 3 here for simplicity). In each condition, participants read a different scenario, but then answered the same set of questions about the scenario they had read. Qualtrics recorded answers from participants in the first condition in the columns V1 through V3 , answers from participants in the second condition in the columns V4 through V6 and answers from participants in the third condition in columns V7 through V9 . (If this block of questions would have contained 4 questions it would have been columns V1 through V4 for answers from participants in the first condition, V2 through V8 for answers from participants in the second condition ...).


You can try this if the length of non-NAs is always the same in rows that aren't entirely filled with NA:

First, create a data frame with the appropriate (transposed) dimensions, and fill it with NAs.

d2 <- data.frame(
        matrix(nrow = max(apply(d, 1, function(ii) sum(!is.na(ii)))),
               ncol=nrow(d)))

Then, using apply fill that data frame, then transpose it to get your desired outcome:

d2[] <- apply(d, 1, function(ii) ii[!is.na(ii)])
t(d2)
#   [,1] [,2] [,3]
#X1    1    2    3
#X2    3    2    1
#X3   NA   NA   NA
#X4   NA   NA   NA
#X5    1    2    3

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM