简体   繁体   中英

R - Remove combinations of variables that occur more than once in a data.frame

Say I have a dataframe, df , with three vectors:

  colours   individual value
1   white individual 1   0.4
2   white individual 1   0.7
3   black individual 2   1.1
4   black individual 3   0.5

Sometimes the same person shows up multiple times for the same colour but different values. I would like to write some code that would delete all of the instances in which this happens.

***EDIT: There are many more rows than 4 - millions - I don't think the current solutions work.

I would like to count how many times the string I am currently on, in my for loop, comes up and then delete them from the data.frame. So in the example above, I would like to get rid of individual 1. The df would then leave the other two rows.

So far my approach was this:

  1. Get a list of all the colours

  2. Get a list of all the individuals

  3. Write two for loops.

    colours <- unique(df$colours) ind <- unique(df$individual) for (i in ind) { for (c in colour) { #something here. Probably if, asking if the person I'm on in the loop #is found with the colour I am on, more than once, get rid of them } }

My expected output is this:

colours  individual   value

black   individual 2   1.1

black   individual 3   0.5

Source data

df <- data.frame(colours = c("white", "white", "black", "black"),
                 individual = c("individual 1", "individual 1", "individual 2", "individual 3"),
                 value = c(0.4, 0.7, 1.1, 0.5))

You could try with anti_join() from the dplyr library:

library(dplyr)
anti_join(df1, df1[duplicated(df1[1:2]),], by="individual")
#  colours   individual value
#1   black individual 3   0.5
#2   black individual 2   1.1

Here is another option using data.table

library(data.table)
setDT(df1)[, if(.N==1) .SD , .(colours, individual)]
#   colours   individual value
#1:   black individual 2   1.1
#2:   black individual 3   0.5

This should do. I created a sample dataset, added index vector to show that you save only the first occurence of a colour-user occurence. This works is your rownames are actual row-number.

## Data preparation
colours <- sample(c("red","blue","green","yellow"), size = 50, replace = T)
users <- sample(1:10, size=50, replace=T )
df <- data.frame(colours,users)
df$value <- runif(50)
df$index <- 1:50

## Keep only the first occurence
res <- unique(df[,1:2])
res$values <- df$value[as.integer(rownames(res))]

A straightforward dplyr approach would be to group as desired and filter for groups with fewer than 2 observations:

library(dplyr)
df %>%
  group_by(colours, individual) %>%
  filter(n() < 2)

Source: local data frame [2 x 3]
Groups: colours, individual [2]

  colours   individual value
   (fctr)       (fctr) (dbl)
1   black individual 2   1.1
2   black individual 3   0.5

On the basis of some suggestions in the comments, this answer worked best:

df[!(duplicated(df[,1:2]) | duplicated(df[,1:2], fromLast = TRUE)), ]

Slightly different to the comments. This specifies the columns rather than the rows and so achieves the result I wanted from the question (remove those rows where individual and colour are duplicated). More useful generally because the example data in the question is only four rows as opposed to millions.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM