[英]How to remove specific duplicates in R
I have the following data: 我有以下数据:
> head(bigdata)
type text
1 neutral The week in 32 photos
2 neutral Look at me! 22 selfies of the week
3 neutral Inside rebel tunnels in Homs
4 neutral Voices from Ukraine
5 neutral Water dries up ahead of World Cup
6 positive Who's your hero? Nominate them
My duplicates will look like this (with empty $type
): 我的副本将如下所示(
$type
为空):
7 Who's your hero? Nominate them
8 Water dries up ahead of World Cup
I remove duplicates like this: 我这样删除重复项:
bigdata <- bigdata[!duplicated(bigdata$text),]
The problem is, it removes the wrong duplicate. 问题是,它删除了错误的重复项。 I want to remove the one where
$type
is empty, not the one that has a value for $type
. 我想删除
$type
为空的那个,而不是$type
有值的那个。
How can I remove a specific duplicate in R? 如何删除R中的特定重复项?
So here's a solution that does not use duplicated(...)
. 所以这是不使用
duplicated(...)
的解决方案。
# creates an example - you have this already...
set.seed(1) # for reproducible example
bigdata <- data.frame(type=rep(c("positive","negative"),5),
text=sample(letters[1:10],10),
stringsAsFactors=F)
# add some duplicates
bigdata <- rbind(bigdata,data.frame(type="",text=bigdata$text[1:5]))
# you start here...
newdf <- with(bigdata,bigdata[order(text,type,decreasing=T),])
result <- aggregate(newdf,by=list(text=newdf$text),head,1)[2:3]
This sorts bigdata
by text and type, in decreasing order, so that for a given text, the empty type
will appear after any non-empty type
. 这将按文本和类型按降序对
bigdata
进行排序,以便对于给定的文本,空type
将出现在任何非空type
。 Then we extract only the first occurrence of each type for every text
. 然后,我们仅提取每个
text
的每种类型的第一个匹配项。
If your data really is "big", then a data.table
solution will probably be faster. 如果您的数据确实“很大”,那么
data.table
解决方案可能会更快。
library(data.table)
DT <- as.data.table(bigdata)
setkey(DT, text, type)
DT.result <- DT[, list(type = type[.N]), by = text]
This does basically the same thing, but since setkey
sorts only in increasing order, we use type[.N]
to get the last occurrence of type
for a every text
. 这样做基本上是一样的,但是由于
setkey
仅按setkey
排序,因此我们使用type[.N]
来获取每个text
的type
的最后一次出现。 .N
is a special variable that holds the number of elements for that group. .N
是一个特殊变量,用于保存该组的元素数。
Note that the current development version implements a function setorder()
, which orders a data.table
by reference , and can order in both increasing and decreasing order. 请注意,当前的开发版本实现了
setorder()
函数,该data.table
通过引用对data.table
进行排序,并且可以按data.table
和降序进行排序。 So, using the devel version , it'd be: 因此,使用开发版本 ,它将是:
require(data.table) # 1.9.3
setorder(DT, text, -type)
DT[, list(type = type[1L]), by = text]
foo = function(x){
x == ""
}
bigdata <- bigdata[-(!duplicated(bigdata$text)&sapply(bigdata$type, foo)),]
You should keep rows that are either not duplicated or not missing a type value. 您应该保留不重复或不缺少类型值的行。 The
duplicated
function only returns the second and later duplicates of each value (check out duplicated(c(1, 1, 2))
), so we need to use both that value and the value of duplicated
called with fromLast=TRUE
. duplicated
函数仅返回每个值的第二个和以后的重复项(签出duplicated(c(1, 1, 2))
fromLast=TRUE
duplicated(c(1, 1, 2))
),因此我们需要使用该值和使用fromLast=TRUE
调用的duplicated
值。
bigdata <- bigdata[!(duplicated(bigdata$text) |
duplicated(bigdata$text, fromLast=TRUE)) |
!is.na(bigdata$type),]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.