简体   繁体   English

通过删除R中的NA来组合2个以上的列

[英]Combining more than 2 columns by removing NA's in R

At first sight this seems a duplicate of Combine/merge columns while avoiding NA? 乍一看,这似乎是Combine / merge列的重复, 同时避免了NA? but in fact it isn't. 但事实上并非如此。 I am dealing sometimes with more than two columns instead of just two. 我有时会处理两列以上而不是两列。

My dataframe looks like this: 我的数据框看起来像这样:

     col1 col2 col3 col4 col5
[1,]    1   NA   NA   13   NA
[2,]   NA   NA   10   NA   18
[3,]   NA    7   NA   15   NA
[4,]    4   NA   NA   16   NA

Now I want to "collapse" this dataframe into a dataframe with less columns and with removed NA's. 现在,我希望将此数据框“折叠”为具有较少列和删除NA的数据帧。 In fact I am looking for and "excel way of doing": removing one cell and the whole row will move one cell to the left. 事实上,我正在寻找和“卓越的做法”:删除一个单元格,整行将向左移动一个单元格。

The result in this example case would be: 此示例中的结果将是:

     col1 col2 
[1,]    1   13   
[2,]   10   18   
[3,]    7   15   
[4,]    4   16   

has anyone an idea about how to do this in R? 有没有人知道如何在R中这样做? Many thanks in advance! 提前谢谢了!

You can use apply for this. 您可以使用apply If df is your dataframe`: 如果df是你的数据帧`:

df2 <- apply(df,1,function(x) x[!is.na(x)])
df3 <- data.frame(t(df2))
colnames(df3) <- colnames(df)[1:ncol(df3)]

Output: 输出:

#      col1 col2
#         1   13
#        10   18
#         7   15
#         4   16

You can use apply and na.exclude 您可以使用applyna.exclude

DF
##   V1 V2 V3 V4 V5
## 1  1 NA NA 13 NA
## 2 NA NA 10 NA 18
## 3 NA  7 NA 15 NA
## 4  4 NA NA 16 NA

t(apply(DF, 1, na.exclude))
##      [,1] [,2]
## [1,]    1   13
## [2,]   10   18
## [3,]    7   15
## [4,]    4   16

If you want to keep the dimensions of the data.frame same, you can use sort with na.last=TRUE instead. 如果要保持data.frame的维度相同,可以使用sortna.last=TRUE This will also take care of cases where you have unequal number of values in different rows. 这也将处理您在不同行中具有不等数量值的情况。

t(apply(DF, 1, sort, na.last = T))
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1   13   NA   NA   NA
## [2,]   10   18   NA   NA   NA
## [3,]    7   15   NA   NA   NA
## [4,]    4   16   NA   NA   NA

This function is a bit long-winded but (1) it will be faster in the long run and (2) it offers a good amount of flexibility: 这个功能有点啰嗦,但(1)从长远来看会更快,(2)它提供了很大的灵活性:

myFun <- function(inmat, outList = TRUE, fill = NA, origDim = FALSE) {
  ## Split up the data by row and isolate the non-NA values
  myList <- lapply(sequence(nrow(inmat)), function(x) {
    y <- inmat[x, ]
    y[!is.na(y)]
  })
  ## If a `list` is all that you want, the function stops here
  if (isTRUE(outList)) {
    myList
  } else {
    ## If you want a matrix instead, it goes on like this
    Len <- vapply(myList, length, 1L)
    ## The new matrix can be either just the number of columns required
    ##   or it can have the same number of columns as the input matrix
    if (isTRUE(origDim)) Ncol <- ncol(inmat) else Ncol <- max(Len)
    Nrow <- nrow(inmat)
    M <- matrix(fill, ncol = Ncol, nrow = Nrow)
    M[cbind(rep(sequence(Nrow), Len), sequence(Len))] <- 
      unlist(myList, use.names=FALSE)
    M
  }
}

To test it out, let's create a function to make up some dummy data: 为了测试它,让我们创建一个函数来组成一些虚拟数据:

makeData <- function(nrow = 10, ncol = 5, pctNA = .8, maxval = 25) {
  a <- nrow * ncol
  m <- matrix(sample(maxval, a, TRUE), ncol = ncol)
  m[sample(a, a * pctNA)] <- NA
  m
}

set.seed(1)
m <- makeData(nrow = 5, ncol = 4, pctNA=.6)
m
#      [,1] [,2] [,3] [,4]
# [1,]   NA   NA   NA   NA
# [2,]   10   24   NA   18
# [3,]   NA   17   NA   25
# [4,]   NA   16   10   NA
# [5,]   NA    2   NA   NA

... and apply it... ......并应用它......

myFun(m)
# [[1]]
# integer(0)
# 
# [[2]]
# [1] 10 24 18
# 
# [[3]]
# [1] 17 25
# 
# [[4]]
# [1] 16 10
# 
# [[5]]
# [1] 2

myFun(m, outList = FALSE)
#      [,1] [,2] [,3]
# [1,]   NA   NA   NA
# [2,]   10   24   18
# [3,]   17   25   NA
# [4,]   16   10   NA
# [5,]    2   NA   NA

## Try also
## myFun(m, outList = FALSE, origDim = TRUE)

And, let's run some timings on bigger data in comparison to the other answers so far: 而且,与目前为止的其他答案相比,让我们对更大的数据运行一些时间:

set.seed(1)
m <- makeData(nrow = 1e5, ncol = 5, pctNA = .75)

## Will return a matrix
funCP <- function(inmat) t(apply(inmat, 1, sort, na.last = T))
system.time(funCP(m))
#    user  system elapsed 
#   9.776   0.000   9.757 

## Will return a list in this case
funJT <- function(inmat) apply(inmat, 1, function(x) x[!is.na(x)])
system.time(JT <- funJT(m))
#    user  system elapsed 
#   0.577   0.000   0.575 

## Output a list
system.time(AM <- myFun(m))
#    user  system elapsed 
#   0.469   0.000   0.466 

identical(JT, AM)
# [1] TRUE

## Output a matrix
system.time(myFun(m, outList=FALSE, origDim=TRUE))
#    user  system elapsed 
#   0.610   0.000   0.612 

So, the list output appears slightly faster than @JT85's solution, and the matrix output appears slightly slower. 因此, list输出看起来比@ JT85的解决方案略快,并且matrix输出看起来稍慢。 But, compared to using sort row-by-row, this is a definite improvement. 但是,与逐行sort相比,这是一个明显的改进。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM