简体   繁体   中英

How to remove specific duplicates in R

I have the following data:

> head(bigdata)
      type                               text
1  neutral              The week in 32 photos
2  neutral Look at me! 22 selfies of the week
3  neutral       Inside rebel tunnels in Homs
4  neutral                Voices from Ukraine
5  neutral  Water dries up ahead of World Cup
6 positive     Who's your hero? Nominate them

My duplicates will look like this (with empty $type ):

7              Who's your hero? Nominate them
8           Water dries up ahead of World Cup

I remove duplicates like this:

bigdata <- bigdata[!duplicated(bigdata$text),]

The problem is, it removes the wrong duplicate. I want to remove the one where $type is empty, not the one that has a value for $type .

How can I remove a specific duplicate in R?

So here's a solution that does not use duplicated(...) .

# creates an example - you have this already...
set.seed(1)   # for reproducible example
bigdata <- data.frame(type=rep(c("positive","negative"),5),
                      text=sample(letters[1:10],10),
                      stringsAsFactors=F)
# add some duplicates
bigdata <- rbind(bigdata,data.frame(type="",text=bigdata$text[1:5]))   

# you start here...
newdf  <- with(bigdata,bigdata[order(text,type,decreasing=T),])
result <- aggregate(newdf,by=list(text=newdf$text),head,1)[2:3]

This sorts bigdata by text and type, in decreasing order, so that for a given text, the empty type will appear after any non-empty type . Then we extract only the first occurrence of each type for every text .


If your data really is "big", then a data.table solution will probably be faster.

library(data.table)
DT <- as.data.table(bigdata)
setkey(DT, text, type)
DT.result <- DT[, list(type = type[.N]), by = text]

This does basically the same thing, but since setkey sorts only in increasing order, we use type[.N] to get the last occurrence of type for a every text . .N is a special variable that holds the number of elements for that group.


Note that the current development version implements a function setorder() , which orders a data.table by reference , and can order in both increasing and decreasing order. So, using the devel version , it'd be:

require(data.table) # 1.9.3
setorder(DT, text, -type)
DT[, list(type = type[1L]), by = text]
foo = function(x){
    x == ""
}

bigdata <- bigdata[-(!duplicated(bigdata$text)&sapply(bigdata$type, foo)),]

You should keep rows that are either not duplicated or not missing a type value. The duplicated function only returns the second and later duplicates of each value (check out duplicated(c(1, 1, 2)) ), so we need to use both that value and the value of duplicated called with fromLast=TRUE .

bigdata <- bigdata[!(duplicated(bigdata$text) |
                     duplicated(bigdata$text, fromLast=TRUE)) |
                   !is.na(bigdata$type),]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM