如何刪除R中的特定重復項

Question

我有以下數據：

> head(bigdata)
      type                               text
1  neutral              The week in 32 photos
2  neutral Look at me! 22 selfies of the week
3  neutral       Inside rebel tunnels in Homs
4  neutral                Voices from Ukraine
5  neutral  Water dries up ahead of World Cup
6 positive     Who's your hero? Nominate them

我的副本將如下所示（ $type為空）：

7              Who's your hero? Nominate them
8           Water dries up ahead of World Cup

我這樣刪除重復項：

bigdata <- bigdata[!duplicated(bigdata$text),]

問題是，它刪除了錯誤的重復項。 我想刪除$type為空的那個，而不是$type有值的那個。

如何刪除R中的特定重復項？

Answer 1

所以這是不使用duplicated(...)的解決方案。

# creates an example - you have this already...
set.seed(1)   # for reproducible example
bigdata <- data.frame(type=rep(c("positive","negative"),5),
                      text=sample(letters[1:10],10),
                      stringsAsFactors=F)
# add some duplicates
bigdata <- rbind(bigdata,data.frame(type="",text=bigdata$text[1:5]))   

# you start here...
newdf  <- with(bigdata,bigdata[order(text,type,decreasing=T),])
result <- aggregate(newdf,by=list(text=newdf$text),head,1)[2:3]

這將按文本和類型按降序對bigdata進行排序，以便對於給定的文本，空type將出現在任何非空type 。 然后，我們僅提取每個text的每種類型的第一個匹配項。

如果您的數據確實“很大”，那么data.table解決方案可能會更快。

library(data.table)
DT <- as.data.table(bigdata)
setkey(DT, text, type)
DT.result <- DT[, list(type = type[.N]), by = text]

這樣做基本上是一樣的，但是由於setkey僅按setkey排序，因此我們使用type[.N]來獲取每個text的type的最后一次出現。 .N是一個特殊變量，用於保存該組的元素數。

請注意，當前的開發版本實現了setorder()函數，該data.table 通過引用對data.table 進行排序，並且可以按data.table和降序進行排序。 因此，使用開發版本，它將是：

require(data.table) # 1.9.3
setorder(DT, text, -type)
DT[, list(type = type[1L]), by = text]

Answer 2

foo = function(x){
    x == ""
}

bigdata <- bigdata[-(!duplicated(bigdata$text)&sapply(bigdata$type, foo)),]

Answer 3

您應該保留不重復或不缺少類型值的行。 duplicated函數僅返回每個值的第二個和以后的重復項（簽出duplicated(c(1, 1, 2)) fromLast=TRUE duplicated(c(1, 1, 2)) ），因此我們需要使用該值和使用fromLast=TRUE調用的duplicated值。

bigdata <- bigdata[!(duplicated(bigdata$text) |
                     duplicated(bigdata$text, fromLast=TRUE)) |
                   !is.na(bigdata$type),]

如何刪除R中的特定重復項

問題描述

3 個解決方案

解決方案1
2 2014-06-13 18:35:18

解決方案2
1 2014-06-13 17:40:24

解決方案3
1 2014-06-13 17:43:30

如何刪除R中的特定重復項

問題描述

3 個解決方案

解決方案1 2 2014-06-13 18:35:18

解決方案2 1 2014-06-13 17:40:24

解決方案3 1 2014-06-13 17:43:30

解決方案1
2 2014-06-13 18:35:18

解決方案2
1 2014-06-13 17:40:24

解決方案3
1 2014-06-13 17:43:30