简体   繁体   English

Data.frame过滤

[英]Data.frame filtering

I have the following data.frame df : 我有以下data.frame df

df = data.frame(col1    = c('a','a','a','a','a','b','b','c','d'),
                col2    = c('a','a','a','b','b','b','b','a','a'),
                height1 = c(NA,32,NA,NA,NA,NA,NA,25,NA),
                height2 = c(31,31.5,NA,NA,11,12,13,NA,NA),
                col3    = 1:9)

#  col1 col2 height1 height2 col3
#1    a    a      NA    31.0    1
#2    a    a      32    31.5    2
#3    a    a      NA      NA    3
#4    a    b      NA      NA    4
#5    a    b      NA    11.0    5
#6    b    b      NA    12.0    6
#7    b    b      NA    13.0    7
#8    c    a      25      NA    8
#9    d    a      NA      NA    9

I want for each couple of value in col1, col2 to build a column height containing values such that: 我希望col1, col2每个值都能构建一个包含以下值的列height

  • If there are only NA in height1 and height2 , return NA . 如果height1height2中只有NA ,则返回NA
  • If there is a value in height1 , take this value. 如果height1有值,请取此值。 (for a couple col1, col2 , there is at most one non NA value in column height1 ) (对于一对col1, col2 ,列height1 1中至多有一个non NA值)
  • If there are only NA in height1 and some non NA values in height2 , take the first value in height2 . 如果只有NAheight1和一些non NA价值观height2 ,采取的第一个值height2

I need also to keep corresponding values in column col3 . 我还需要在列col3保留相应的值。

The new data.frame new.df will look like: 新的data.frame new.df将如下所示:

#  col1 col2 height col3
#1    a    a     32    2
#2    a    b     11    5
#3    b    b     12    6
#4    c    a     25    8
#5    d    a     NA    9

I would prefer a data.frame approach, quite concise, but I realize I am unable to find one! 我更喜欢data.frame方法,非常简洁,但我意识到我无法找到一个!

Maybe not the elegant solution you are looking for but here is a base R option: 也许不是您正在寻找的优雅解决方案,但这里是一个base R选项:

do.call("rbind",
        lapply(split(df,paste0(df$col1,df$col2)),
               function(tab) {
                 colnames(tab)[3:4] <- "height" 
                 out <- if(any(!is.na(tab[, 3]))) {
                           tab[which(!is.na(tab[,3])),-4]
                        } else {
                           if (any(!is.na(tab[,4]))) {
                              tab[which(!is.na(tab[,4]))[1],c(1:2,4:5)]
                           } else {
                              tab[1,-4]
                           }
                        }
                return(out)
               }
        )
      )

#       col1 col2 height col3
#    aa    a    a     32    2
#    ab    a    b     11    5
#    bb    b    b     12    6
#    ca    c    a     25    8
#    da    d    a     NA    9

With dplyr: 使用dplyr:

df %>%
  mutate( 
    order = ifelse(!is.na(height1), 1, ifelse(!is.na(height2), 2, 3)),
    height = ifelse(!is.na(height1), height1, ifelse(!is.na(height2), height2, NA))
    ) %>%
  arrange( col1, col2, order, height) %>%
  distinct(col1, col2) %>%
  select( col1, col2, height, col3)

I use data.table (whereas I would like to use data.frame option exceptionaly there) and I find my solution unelegant: 我使用data.table (而我想在那里使用data.frame选项异常)并且我发现我的解决方案不优雅:

func = function(df)
{
    if(all(is.na(subset(df, select=c(height1,height2)))))
        return(df[1,])

    if(any(!is.na(df$height1)))
        return(df[!is.na(df$height1),])

    df[!is.na(df$height2),][1,]
}

setDT(df)
new.df=df[,func(.SD),by=list(col1,col2)]
new.df = data.frame(new.df)

new.df$height = ifelse(is.na(new.df$height1), new.df$height2, new.df$height1)

#> new.df
#  col1 col2 height1 height2 col3 height
#1    a    a      32    31.5    2     32
#2    a    b      NA    11.0    5     11
#3    b    b      NA    12.0    6     12
#4    c    a      25      NA    8     25
#5    d    a      NA      NA    9     NA

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM