简体   繁体   English

比较R中的多个布尔列

[英]Compare multiple boolean columns in r

little crossword puzzle. 小填字游戏。 As always I think I'm missing something. 和往常一样,我认为我缺少了一些东西。 I have a dataframe like this: 我有一个这样的数据框:

id creator att1 att2 att3 att... att500
a1 person1 TRUE TRUE FALSE ...
a2 person2 TRUE TRUE TRUE ...
a3 person1 TRUE FALSE FALSE ...
a4 person1 TRUE TRUE FALSE ...
a5 person2 TRUE TRUE FALSE ...

And so on. 等等。 I want to count the occurences of the same attribute combination (about 500 boolish values) by different creators and do this for each line, adding the count to the repective line. 我想计算不同创建者的相同属性组合(大约500个布尔值)的出现次数,并针对每一行执行此操作,将计数添加到相应的行中。 In the above example hence I want to have count=1 for the first row (a1) because in a5 a different person has done the very same attribute combination. 因此,在上面的示例中,我希望第一行(a1)的count = 1,因为在a5中,其他人执行了非常相同的属性组合。 Notice that a4 does not count, because it is the same combination but by the same person. 请注意,a4不计算在内,因为它是相同的组合,但是是同一个人的。 Think of self mixed cocktails and the frequency they are mixed by different persons independent of each other. 想一想自己混合的鸡尾酒,以及不同人彼此独立地混合鸡尾酒的频率。 row a2 shall have a count of 0, so shall a3 (no same attribute combination) and a4 respectively count = 1 because of a5. 行a2的计数应为0,由于a5,a3(没有相同的属性组合)和a4的计数应分别为1。 a5 has a count of 1 too. a5的计数也为1。 However, if other persons mix the same cocktail several times, this shall be counted. 但是,如果其他人多次混合同一鸡尾酒,则应计算在内。 I don't want to simply remove duplicates. 我不想简单地删除重复项。

My plan is hence to loop through the rows, exclude all cocktails by the same creator of the row, take the attribute combination and compare it with all the rows in the temporary dataset: 因此,我的计划是遍历行,排除行的同一创建者的所有鸡尾酒,采用属性组合,并将其与临时数据集中的所有行进行比较:

for (row in 1:nrow(data)){ 
# for each row in data
   creator <- row$creator 
# get creator
   attr_tupel <- row[1, 3:500] 
#return the attribute combination of the row
   data[row]$count <- nrow(data[data$creator != creator & data[3:500] == attr_tupel]) 
# into the column $count of the current row write the number of observations that are not from the same creator and match the exact tupel of my ~500 Attributes (equal cocktails by different persons)
}

Unfortunately I can't compare the tupel of the reference row with the other rows, as '==' only defined for equally-sized data frames 不幸的是,我无法将参考行的tupel与其他行的tupel进行比较,因为“ ==”仅针对大小相等的数据帧定义

And now I'm stuck. 现在我被卡住了。 I could for sure write each column separately - but that would take ages. 我肯定可以分别写每一列-但是要花一些时间。 Do I need to cast that dataframe into a list or vector or //insert sthg here// (vector and list doesn't work.) Is it at all possible to compare one row of values with many other rows for equality? 我是否需要将该数据帧转换为列表或向量,或者//在此处插入sthg // (向量和列表不起作用。)是否可以将值的一行与其他许多行进行比较以求相等? I don't think having a duplicate of the row would be the solution, besides usually R does simply loop through the entries when he does not have anything to compare anymore. 我不认为有重复的行是解决方案,除了通常R会在没有任何可比较的条件时简单地遍历条目。 Why not here? 为什么不在这里?

I read several threads about comparing several columns with each other, but did not succeed in transferring the solutions to my problem. 我读了一些有关相互比较几列的主题,但是没有成功地将解决方案转移到我的问题上。 eg: wants to look up one value for the boolish value, I have multiple TRUE values , same , wants to convert to ac() - which I could do too and compare those, but kind of a hard way, isn't it? 例如: 想为布尔值查找一个值,我有多个TRUE值 ,同一个想要转换为ac(),我也可以这样做并比较它们,但是有点困难,不是吗?

At last (from that last link) I was now even thinking of converting the boolish values to a number (adding indices so that we have 最后(从最后一个链接开始)我现在甚至在考虑将布尔值转换为数字(添加索引,以便

id creator att1 ... index
a1 person1 1 2 0 ... 3 
a2 person2 1 2 3 ... 6

and compare that index. 并比较该指数。 Should work. 应该管用。 But kind of feel like that is an ugly workaround. 但是那种感觉是一个丑陋的解决方法。 Also when thinking of having data other than boolean, like several strings, I'd still in the long run like to able to compare a tupel of columns against each other independent of their content. 同样,当考虑使用布尔值以外的数据(例如多个字符串)时,从长远来看,我仍然希望能够将列的Tupel相互比较,而与它们的内容无关。

What am I missing? 我想念什么? :) :)

Thanks for your help! 谢谢你的帮助!

as asked for in the comment, here short script to create a similar dataframe. 如评论中所要求的,这里是创建类似数据框的简短脚本。 Keep in mind though that there are way more columns to compare. 请记住,尽管有更多的列可以比较。

id <- 1:50
names <- paste("creator", rep(1:10, each = 5))
bools1 <- rnorm(n=50, mean = 5, sd = 3)
bools1 <- ifelse(bools1>5, TRUE, FALSE)
bools2 <- rnorm(n=50, mean = 5, sd = 3)
bools2 <- ifelse(bools2>5, TRUE, FALSE)
bools3 <- rnorm(n=50, mean = 5, sd = 3)
bools3 <- ifelse(bools3>5, TRUE, FALSE)
bools4 <- rnorm(n=50, mean = 5, sd = 3)
bools4 <- ifelse(bools4>5, TRUE, FALSE)
bools5 <- rnorm(n=50, mean = 5, sd = 3)
bools5 <- ifelse(bools5>5, TRUE, FALSE)

data <- data.frame(id, names, bools1, bools2, bools3, bools4, bools5)

EDIT : Sorry - my first solution misread the question. 编辑 :对不起-我的第一个解决方案误解了问题。 Try this instead 试试这个

You can run this using data table: 您可以使用数据表来运行它:

#Your set up data (with seed)
set.seed(123)
id <- 1:50
names <- paste("creator", rep(1:10, each = 5))
bools1 <- rnorm(n=50, mean = 5, sd = 3)
bools1 <- ifelse(bools1>5, TRUE, FALSE)
bools2 <- rnorm(n=50, mean = 5, sd = 3)
bools2 <- ifelse(bools2>5, TRUE, FALSE)
bools3 <- rnorm(n=50, mean = 5, sd = 3)
bools3 <- ifelse(bools3>5, TRUE, FALSE)
bools4 <- rnorm(n=50, mean = 5, sd = 3)
bools4 <- ifelse(bools4>5, TRUE, FALSE)
bools5 <- rnorm(n=50, mean = 5, sd = 3)
bools5 <- ifelse(bools5>5, TRUE, FALSE)

data <- data.frame(id, names, bools1, bools2, bools3, bools4, bools5)

# Code to run

library(data.table)

setDT(data)
dt_m <- melt(data, id.vars = c("id","names"), variable.factor = TRUE)
dt_m <- dt_m[,.(drink = paste0(value, collapse = "_")), by = .(id, names)]
dt_m[, times_made := .N, by = drink][, times_made_others := times_made - .N, by = .(drink, names)]
dt_out <- merge(data, dt_m[, .(id, drink, times_made_others)], by = "id")

Essentially what you are doing is creating the "drinks" by collapsing the columns together, counting the number of times that drink was made by others, and then merging that back to your original data set. 本质上,您正在做的是通过将各列折叠在一起,计算其他人制作饮料的次数,然后将其合并回原始数据集来创建“饮料”。

dt_out
    id      names bools1 bools2 bools3 bools4 bools5                        drink times_made_others
 1:  1  creator 1  FALSE   TRUE  FALSE   TRUE   TRUE   FALSE_TRUE_FALSE_TRUE_TRUE                 3
 2:  2  creator 1  FALSE  FALSE   TRUE   TRUE   TRUE   FALSE_FALSE_TRUE_TRUE_TRUE                 1
 3:  3  creator 1   TRUE  FALSE  FALSE   TRUE  FALSE  TRUE_FALSE_FALSE_TRUE_FALSE                 2
 4:  4  creator 1   TRUE   TRUE  FALSE  FALSE   TRUE   TRUE_TRUE_FALSE_FALSE_TRUE                 0
 5:  5  creator 1   TRUE  FALSE  FALSE  FALSE  FALSE TRUE_FALSE_FALSE_FALSE_FALSE                 3
 6:  6  creator 2   TRUE   TRUE  FALSE  FALSE  FALSE  TRUE_TRUE_FALSE_FALSE_FALSE                 2
 7:  7  creator 2   TRUE  FALSE  FALSE   TRUE  FALSE  TRUE_FALSE_FALSE_TRUE_FALSE                 2

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM