简体   繁体   English

如何从数据框中删除重复的行和列而忽略NAs?

[英]How to remove duplicated rows and columns from a data frame disregarding NAs?

I would like to remove duplicated columns from a data frame, disregarding NAs. 我想从数据框中删除重复的列,而忽略了NA。 All columns of the data frame are numeric vectors with equal length. 数据帧的所有列都是具有相等长度的数字向量。 Here is an example: 这是一个例子:

> df <- data.frame(a = c(1,2,NA,4,4), b= c(5,6,7,8,8), c= c(5,6,7,8,8), d = c(9,8,7,6,NA), e = c(NA,8,7,6,6))
> df
   a b c  d  e
1  1 5 5  9 NA
2  2 6 6  8  8
3 NA 7 7  7  7
4  4 8 8  6  6
5  4 8 8 NA  6

I would like to get this data frame as a result: 我希望得到这个数据框:

> df_clear
   a b d
1  1 5 9
2  2 6 8
3 NA 7 7
4  4 8 6

I have tried "unique”, but without any success. Only duplicates without NAs were removed. 我尝试过“独特”,但没有任何成功。只删除没有NA的重复项。

> df_clear <- 
+   df %>%
+     unique %>%
+     t %>%
+     as.matrix %>%
+     unique %>%
+     t %>%
+     as.data.frame
> df_clear
   a b  d  e
1  1 5  9 NA
2  2 6  8  8
3 NA 7  7  7
4  4 8  6  6
5  4 8 NA  6

"distinct" from dplyr didn't help either. 来自dplyr的“独特”也没有帮助。 I even lost the column names with this approach which is a problem. 我甚至用这种方法丢失了列名,这是一个问题。

> df_clear <- 
+   df %>%
+     distinct %>%
+     t %>%
+     as.data.frame %>%
+     distinct %>%
+     t %>%
+     as.data.frame
> df_clear
   V1 V2 V3 V4
V1  1  5  9 NA
V2  2  6  8  8
V3 NA  7  7  7
V4  4  8  6  6
V5  4  8 NA  6

I wonder if there is any function that does the job or I should write it for myself. 我想知道是否有任何功能可以完成这项工作,或者我应该为自己编写。 The real data frame has over 1000 rows and columns. 真实数据框有超过1000行和列。

Thanks a lot for your help! 非常感谢你的帮助!

EDIT 编辑

After reading the comments I realized that I under-defined the original question. 阅读完评论后,我意识到我对原始问题的定义不明确。 Here are some clarification. 以下是一些澄清。 For the sake of simplicity I focus on rows only: 为简单起见,我只关注行:
- In case of duplicates the remaining row should contain as few NAs as possible. - 如果重复,则剩余行应包含尽可能少的NA。 Eg df1 should appear as df1_clear 例如,df1应显示为df1_clear

> df1
   a b  d e
1  1 4  7 1
2  3 6 NA 3
3  2 5  8 2
4 NA 6  9 3
> df1_clear
  a b d e
1 1 4 7 1
2 2 5 8 2
3 3 6 9 3
  • The duplicates are not necessarily consecutive. 重复不一定是连续的。
  • There could be more than one NA in a row. 连续可能有多个NA。

The following is a bit complicated but it does the job. 以下是有点复杂但它完成了工作。
It calls a function within fun twice, to remove the duplicates of the original dataframe, then of its transpose. 它将fun的函数调用两次,以删除原始数据帧的副本,然后删除其转置。

fun <- function(DF){
  f <- function(DF1){
    df1 <- DF1
    df1[] <- lapply(df1, function(x){
      y <- zoo::na.locf(x)
      if(length(y) < length(x)) y <- zoo::na.locf(x, fromLast = TRUE)
      y
    })
    DF1[!duplicated(df1), ]
  }
  df2 <- f(DF)
  df2 <- as.data.frame(t(df2))
  df2 <- t(f(df2))
  as.data.frame(df2)
}

fun(df)
#   a b d
#1  1 5 9
#2  2 6 8
#3 NA 7 7
#4  4 8 6

Based on the above, it is possible to do it with the function f() in fun and dplyr pipes. 基于以上所述,可以使用fundplyr管道中的函数f()来完成它。 Function f() below is just a copy&paste of the function above. 下面的函数f()只是上面函数的复制和粘贴。

library(dplyr)


f <- function(DF1){
  df1 <- DF1
  df1[] <- lapply(df1, function(x){
    y <- zoo::na.locf(x)
    if(length(y) < length(x)) y <- zoo::na.locf(x, fromLast = TRUE)
    y
  })
  DF1[!duplicated(df1), ]
}


df %>%
  f() %>% t() %>% as.data.frame() %>%
  f() %>% t() %>% as.data.frame()

#   a b d
#1  1 5 9
#2  2 6 8
#3 NA 7 7
#4  4 8 6

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM