简体   繁体   中英

Compare multiple boolean columns in r

little crossword puzzle. As always I think I'm missing something. I have a dataframe like this:

id creator att1 att2 att3 att... att500
a1 person1 TRUE TRUE FALSE ...
a2 person2 TRUE TRUE TRUE ...
a3 person1 TRUE FALSE FALSE ...
a4 person1 TRUE TRUE FALSE ...
a5 person2 TRUE TRUE FALSE ...

And so on. I want to count the occurences of the same attribute combination (about 500 boolish values) by different creators and do this for each line, adding the count to the repective line. In the above example hence I want to have count=1 for the first row (a1) because in a5 a different person has done the very same attribute combination. Notice that a4 does not count, because it is the same combination but by the same person. Think of self mixed cocktails and the frequency they are mixed by different persons independent of each other. row a2 shall have a count of 0, so shall a3 (no same attribute combination) and a4 respectively count = 1 because of a5. a5 has a count of 1 too. However, if other persons mix the same cocktail several times, this shall be counted. I don't want to simply remove duplicates.

My plan is hence to loop through the rows, exclude all cocktails by the same creator of the row, take the attribute combination and compare it with all the rows in the temporary dataset:

for (row in 1:nrow(data)){ 
# for each row in data
   creator <- row$creator 
# get creator
   attr_tupel <- row[1, 3:500] 
#return the attribute combination of the row
   data[row]$count <- nrow(data[data$creator != creator & data[3:500] == attr_tupel]) 
# into the column $count of the current row write the number of observations that are not from the same creator and match the exact tupel of my ~500 Attributes (equal cocktails by different persons)
}

Unfortunately I can't compare the tupel of the reference row with the other rows, as '==' only defined for equally-sized data frames

And now I'm stuck. I could for sure write each column separately - but that would take ages. Do I need to cast that dataframe into a list or vector or //insert sthg here// (vector and list doesn't work.) Is it at all possible to compare one row of values with many other rows for equality? I don't think having a duplicate of the row would be the solution, besides usually R does simply loop through the entries when he does not have anything to compare anymore. Why not here?

I read several threads about comparing several columns with each other, but did not succeed in transferring the solutions to my problem. eg: wants to look up one value for the boolish value, I have multiple TRUE values , same , wants to convert to ac() - which I could do too and compare those, but kind of a hard way, isn't it?

At last (from that last link) I was now even thinking of converting the boolish values to a number (adding indices so that we have

id creator att1 ... index
a1 person1 1 2 0 ... 3 
a2 person2 1 2 3 ... 6

and compare that index. Should work. But kind of feel like that is an ugly workaround. Also when thinking of having data other than boolean, like several strings, I'd still in the long run like to able to compare a tupel of columns against each other independent of their content.

What am I missing? :)

Thanks for your help!

as asked for in the comment, here short script to create a similar dataframe. Keep in mind though that there are way more columns to compare.

id <- 1:50
names <- paste("creator", rep(1:10, each = 5))
bools1 <- rnorm(n=50, mean = 5, sd = 3)
bools1 <- ifelse(bools1>5, TRUE, FALSE)
bools2 <- rnorm(n=50, mean = 5, sd = 3)
bools2 <- ifelse(bools2>5, TRUE, FALSE)
bools3 <- rnorm(n=50, mean = 5, sd = 3)
bools3 <- ifelse(bools3>5, TRUE, FALSE)
bools4 <- rnorm(n=50, mean = 5, sd = 3)
bools4 <- ifelse(bools4>5, TRUE, FALSE)
bools5 <- rnorm(n=50, mean = 5, sd = 3)
bools5 <- ifelse(bools5>5, TRUE, FALSE)

data <- data.frame(id, names, bools1, bools2, bools3, bools4, bools5)

EDIT : Sorry - my first solution misread the question. Try this instead

You can run this using data table:

#Your set up data (with seed)
set.seed(123)
id <- 1:50
names <- paste("creator", rep(1:10, each = 5))
bools1 <- rnorm(n=50, mean = 5, sd = 3)
bools1 <- ifelse(bools1>5, TRUE, FALSE)
bools2 <- rnorm(n=50, mean = 5, sd = 3)
bools2 <- ifelse(bools2>5, TRUE, FALSE)
bools3 <- rnorm(n=50, mean = 5, sd = 3)
bools3 <- ifelse(bools3>5, TRUE, FALSE)
bools4 <- rnorm(n=50, mean = 5, sd = 3)
bools4 <- ifelse(bools4>5, TRUE, FALSE)
bools5 <- rnorm(n=50, mean = 5, sd = 3)
bools5 <- ifelse(bools5>5, TRUE, FALSE)

data <- data.frame(id, names, bools1, bools2, bools3, bools4, bools5)

# Code to run

library(data.table)

setDT(data)
dt_m <- melt(data, id.vars = c("id","names"), variable.factor = TRUE)
dt_m <- dt_m[,.(drink = paste0(value, collapse = "_")), by = .(id, names)]
dt_m[, times_made := .N, by = drink][, times_made_others := times_made - .N, by = .(drink, names)]
dt_out <- merge(data, dt_m[, .(id, drink, times_made_others)], by = "id")

Essentially what you are doing is creating the "drinks" by collapsing the columns together, counting the number of times that drink was made by others, and then merging that back to your original data set.

dt_out
    id      names bools1 bools2 bools3 bools4 bools5                        drink times_made_others
 1:  1  creator 1  FALSE   TRUE  FALSE   TRUE   TRUE   FALSE_TRUE_FALSE_TRUE_TRUE                 3
 2:  2  creator 1  FALSE  FALSE   TRUE   TRUE   TRUE   FALSE_FALSE_TRUE_TRUE_TRUE                 1
 3:  3  creator 1   TRUE  FALSE  FALSE   TRUE  FALSE  TRUE_FALSE_FALSE_TRUE_FALSE                 2
 4:  4  creator 1   TRUE   TRUE  FALSE  FALSE   TRUE   TRUE_TRUE_FALSE_FALSE_TRUE                 0
 5:  5  creator 1   TRUE  FALSE  FALSE  FALSE  FALSE TRUE_FALSE_FALSE_FALSE_FALSE                 3
 6:  6  creator 2   TRUE   TRUE  FALSE  FALSE  FALSE  TRUE_TRUE_FALSE_FALSE_FALSE                 2
 7:  7  creator 2   TRUE  FALSE  FALSE   TRUE  FALSE  TRUE_FALSE_FALSE_TRUE_FALSE                 2

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM