I would like to remove duplicated columns from a data frame, disregarding NAs. All columns of the data frame are numeric vectors with equal length. Here is an example:
> df <- data.frame(a = c(1,2,NA,4,4), b= c(5,6,7,8,8), c= c(5,6,7,8,8), d = c(9,8,7,6,NA), e = c(NA,8,7,6,6))
> df
a b c d e
1 1 5 5 9 NA
2 2 6 6 8 8
3 NA 7 7 7 7
4 4 8 8 6 6
5 4 8 8 NA 6
I would like to get this data frame as a result:
> df_clear
a b d
1 1 5 9
2 2 6 8
3 NA 7 7
4 4 8 6
I have tried "unique”, but without any success. Only duplicates without NAs were removed.
> df_clear <-
+ df %>%
+ unique %>%
+ t %>%
+ as.matrix %>%
+ unique %>%
+ t %>%
+ as.data.frame
> df_clear
a b d e
1 1 5 9 NA
2 2 6 8 8
3 NA 7 7 7
4 4 8 6 6
5 4 8 NA 6
"distinct" from dplyr didn't help either. I even lost the column names with this approach which is a problem.
> df_clear <-
+ df %>%
+ distinct %>%
+ t %>%
+ as.data.frame %>%
+ distinct %>%
+ t %>%
+ as.data.frame
> df_clear
V1 V2 V3 V4
V1 1 5 9 NA
V2 2 6 8 8
V3 NA 7 7 7
V4 4 8 6 6
V5 4 8 NA 6
I wonder if there is any function that does the job or I should write it for myself. The real data frame has over 1000 rows and columns.
Thanks a lot for your help!
EDIT
After reading the comments I realized that I under-defined the original question. Here are some clarification. For the sake of simplicity I focus on rows only:
- In case of duplicates the remaining row should contain as few NAs as possible. Eg df1 should appear as df1_clear
> df1
a b d e
1 1 4 7 1
2 3 6 NA 3
3 2 5 8 2
4 NA 6 9 3
> df1_clear
a b d e
1 1 4 7 1
2 2 5 8 2
3 3 6 9 3
The following is a bit complicated but it does the job.
It calls a function within fun
twice, to remove the duplicates of the original dataframe, then of its transpose.
fun <- function(DF){
f <- function(DF1){
df1 <- DF1
df1[] <- lapply(df1, function(x){
y <- zoo::na.locf(x)
if(length(y) < length(x)) y <- zoo::na.locf(x, fromLast = TRUE)
y
})
DF1[!duplicated(df1), ]
}
df2 <- f(DF)
df2 <- as.data.frame(t(df2))
df2 <- t(f(df2))
as.data.frame(df2)
}
fun(df)
# a b d
#1 1 5 9
#2 2 6 8
#3 NA 7 7
#4 4 8 6
Based on the above, it is possible to do it with the function f()
in fun
and dplyr
pipes. Function f()
below is just a copy&paste of the function above.
library(dplyr)
f <- function(DF1){
df1 <- DF1
df1[] <- lapply(df1, function(x){
y <- zoo::na.locf(x)
if(length(y) < length(x)) y <- zoo::na.locf(x, fromLast = TRUE)
y
})
DF1[!duplicated(df1), ]
}
df %>%
f() %>% t() %>% as.data.frame() %>%
f() %>% t() %>% as.data.frame()
# a b d
#1 1 5 9
#2 2 6 8
#3 NA 7 7
#4 4 8 6
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.