简体   繁体   English

从data.table中删除重复出现的数据

[英]Remove reoccurring data from data.table

I have a data.table like this: 我有一个像这样的data.table:

data.table(a=rep(c("xx", "yy"), each=4), b=rep(c("zz", "nn"), each=2), vals=10:17)

    a  b vals
1: xx zz   10
2: xx zz   11
3: xx nn   12
4: xx nn   13
5: yy zz   14
6: yy zz   15
7: yy nn   16
8: yy nn   17

What i want is this, since it looks better in a table when exported to excel and then to words (I know, never use excel...): 我想要的是这个,因为在导出到excel然后再导入单词时,它在表格中看起来更好(我知道,从不使用excel ......):

    a  b vals
1: xx zz   10
2: NA NA   11
3: NA nn   12
4: NA NA   13
5: yy zz   14
6: NA NA   15
7: NA nn   16
8: NA NA   17

EDIT: forgot to say that if a numeric value is recurring, it should not be changed to NA, only character columns. 编辑:忘了说如果一个数值重复出现,它不应该更改为NA,只能更改为字符列。

Using rleid from data.table we can create a function 使用rleiddata.table我们可以创建一个函数

library(data.table)

replace_duplicated <- function(x) {
  replace(x, duplicated(rleid(x)), NA)
}

and now apply it to selected columns (Thanks to @markus) 现在将其应用于选定的列(感谢@markus)

cols = names(df)[sapply(df, is.character)]
df[,(cols) := lapply(.SD, replace_duplicated ), .SDcols = cols]
df

#      a    b vals
#1:   xx   zz   10
#2: <NA> <NA>   11
#3: <NA>   nn   12
#4: <NA> <NA>   13
#5:   yy   zz   14
#6: <NA> <NA>   15
#7: <NA>   nn   16
#8: <NA> <NA>   17

In dplyr we can use mutate_if dplyr我们可以使用mutate_if

library(dplyr)
df %>% mutate_if(is.character, replace_duplicated)

or mutate_at mutate_at

df %>% mutate_at(cols, replace_duplicated)

We can use set from data.table to update by reference 我们可以使用set from data.table来通过引用进行更新

nm1 <- names(dt)[1:2]
for(j in nm1) set(dt, i = which(duplicated(rleid(dt[[j]]))), j = j, value = NA)
dt
#      a    b vals
#1:   xx   zz   10
#2: <NA> <NA>   11
#3: <NA>   nn   12
#4: <NA> <NA>   13
#5:   yy   zz   14
#6: <NA> <NA>   15
#7: <NA>   nn   16
#8: <NA> <NA>   17

Adding another method using shift and some timings for reference: 使用shift和一些时序添加另一种方法以供参考:

set.seed(0L)
sz <- 1e7
DT <- data.table(a=sample(LETTERS, sz, TRUE), b=sample(LETTERS, sz, TRUE))
#DT <- data.table(a=rep(c("xx", "yy"), each=4), b=rep(c("zz", "nn"), each=2), vals=10:17)
DT1 <- copy(DT)
DT2 <- copy(DT)

cols <- c("a","b")


mtd0 <- function() {
    DT[,(cols) := lapply(.SD, function(x) 
        replace(x, duplicated(rleid(x)), NA_character_)) , .SDcols = cols]
}

mtd1 <- function() {
    for(j in cols) 
        set(DT1, i=DT1[, which(get(j)==shift(get(j), 1L))], j=j, value=NA_character_)
}

mtd2 <- function() {
    for(j in cols) 
        set(DT2, i=which(duplicated(rleid(DT2[[j]]))), j=j, value=NA_character_)
}

library(microbenchmark)
microbenchmark(mtd0(), mtd1(), mtd2(), times=3L)

identical(DT, DT1)
#[1] TRUE

identical(DT1, DT2)
#[1] TRUE

timings: 定时:

Unit: milliseconds
   expr       min        lq      mean    median        uq       max neval cld
 mtd0() 1372.4244 1405.1756 1448.8020 1437.9269 1486.9909 1536.0549     3   b
 mtd1()  280.7695  281.2639  305.5433  281.7583  317.9303  354.1022     3  a 
 mtd2() 1200.5236 1224.5174 1339.0146 1248.5112 1408.2601 1568.0090     3   b

You can do this with a quick loop: 您可以通过快速循环执行此操作:

df <- data.frame(a=rep(c("xx", "yy"), each=4), b=rep(c("zz", "nn"), each=2), vals=10:17)

for(i in 1:2){
  df[,i][duplicated(df[,i])] <-NA
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM