简体   繁体   English

如何删除R中的特定重复项

[英]How to remove specific duplicates in R

I have the following data: 我有以下数据:

> head(bigdata)
      type                               text
1  neutral              The week in 32 photos
2  neutral Look at me! 22 selfies of the week
3  neutral       Inside rebel tunnels in Homs
4  neutral                Voices from Ukraine
5  neutral  Water dries up ahead of World Cup
6 positive     Who's your hero? Nominate them

My duplicates will look like this (with empty $type ): 我的副本将如下所示( $type为空):

7              Who's your hero? Nominate them
8           Water dries up ahead of World Cup

I remove duplicates like this: 我这样删除重复项:

bigdata <- bigdata[!duplicated(bigdata$text),]

The problem is, it removes the wrong duplicate. 问题是,它删除了错误的重复项。 I want to remove the one where $type is empty, not the one that has a value for $type . 我想删除$type为空的那个,而不是$type有值的那个。

How can I remove a specific duplicate in R? 如何删除R中的特定重复项?

So here's a solution that does not use duplicated(...) . 所以这是不使用duplicated(...)的解决方案。

# creates an example - you have this already...
set.seed(1)   # for reproducible example
bigdata <- data.frame(type=rep(c("positive","negative"),5),
                      text=sample(letters[1:10],10),
                      stringsAsFactors=F)
# add some duplicates
bigdata <- rbind(bigdata,data.frame(type="",text=bigdata$text[1:5]))   

# you start here...
newdf  <- with(bigdata,bigdata[order(text,type,decreasing=T),])
result <- aggregate(newdf,by=list(text=newdf$text),head,1)[2:3]

This sorts bigdata by text and type, in decreasing order, so that for a given text, the empty type will appear after any non-empty type . 这将按文本和类型按降序对bigdata进行排序,以便对于给定的文本,空type将出现在任何非空type Then we extract only the first occurrence of each type for every text . 然后,我们仅提取每个text的每种类型的第一个匹配项。


If your data really is "big", then a data.table solution will probably be faster. 如果您的数据确实“很大”,那么data.table解决方案可能会更快。

library(data.table)
DT <- as.data.table(bigdata)
setkey(DT, text, type)
DT.result <- DT[, list(type = type[.N]), by = text]

This does basically the same thing, but since setkey sorts only in increasing order, we use type[.N] to get the last occurrence of type for a every text . 这样做基本上是一样的,但是由于setkey仅按setkey排序,因此我们使用type[.N]来获取每个texttype最后一次出现。 .N is a special variable that holds the number of elements for that group. .N是一个特殊变量,用于保存该组的元素数。


Note that the current development version implements a function setorder() , which orders a data.table by reference , and can order in both increasing and decreasing order. 请注意,当前的开发版本实现了setorder()函数,该data.table 通过引用data.table 进行排序,并且可以按data.table和降序进行排序。 So, using the devel version , it'd be: 因此,使用开发版本 ,它将是:

require(data.table) # 1.9.3
setorder(DT, text, -type)
DT[, list(type = type[1L]), by = text]
foo = function(x){
    x == ""
}

bigdata <- bigdata[-(!duplicated(bigdata$text)&sapply(bigdata$type, foo)),]

You should keep rows that are either not duplicated or not missing a type value. 您应该保留不重复或不缺少类型值的行。 The duplicated function only returns the second and later duplicates of each value (check out duplicated(c(1, 1, 2)) ), so we need to use both that value and the value of duplicated called with fromLast=TRUE . duplicated函数仅返回每个值的第二个和以后的重复项(签出duplicated(c(1, 1, 2)) fromLast=TRUE duplicated(c(1, 1, 2)) ),因此我们需要使用该值和使用fromLast=TRUE调用的duplicated值。

bigdata <- bigdata[!(duplicated(bigdata$text) |
                     duplicated(bigdata$text, fromLast=TRUE)) |
                   !is.na(bigdata$type),]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM