简体   繁体   English

在R中,将一个列值与所有其他列进行比较

[英]In R, compare one column value to all other columns

I'm very new to R and I have a question which might be very simple for experts here. 我对R很新,我有一个问题,对于这里的专家来说可能非常简单。

Let's say i have a table "sales", which includes 4 customer IDs (123-126) and 4 products (A,B,C,D). 假设我有一个表“sales”,其中包括4个客户ID(123-126)和4个产品(A,B,C,D)。

ID  A   B   C   D
123 0   1   1   0
124 1   1   0   0
125 1   1   0   1
126 0   0   0   1

I want to calculate the overlaps between products. 我想计算产品之间的重叠。 So for A, the number of IDs that have both A and B will be 2. Similarly, the overlap between A and C will be 0 and that between A and D will be 1. Here is my code for A and B overlap: 因此对于A,具有A和B的ID的数量将是2.类似地,A和C之间的重叠将是0,并且A和D之间的重叠将是1.这是我的A和B重叠的代码:

overlap <- sales [which(sales [,"A"] == 1 & sales [,"B"] == 1 ),]
countAB <- count(overlap,"ID")

I want to repeat this calculation for all 4 products,so A overlaps with B,C,D and B overlaps with A,C,D, etc...How can i change the code to accomplish this? 我想对所有4个产品重复这个计算,所以A与B,C,D和B重叠,与A,C,D等重叠......我如何更改代码来实现这一目标?

I want the final output to be the number of IDs for each two-product combination. 我希望最终输出是每个双产品组合的ID数。 It's product affinity exercise and i want to find out for one product, which product sold the most with it. 这是产品亲和力练习,我想找出一种产品,哪种产品最畅销。 For example, for A, the most sold products with it will be B, followed by D, then C. Some sorting needs to be added to the code to get to this i think. 例如,对于A,使用它的销售最多的产品将是B,然后是D,然后是C.需要将一些排序添加到代码中以实现此目的。

Thanks for your help! 谢谢你的帮助!

Here's a possible solution : 这是一个可能的解决方案:

sales <- 
read.csv(text=
"ID,A,B,C,D
123,0,1,1,0
124,1,1,0,0
125,1,1,0,1
126,0,0,0,1")

# get product names
prods <- colnames(sales)[-1]
# generate all products pairs (and transpose the matrix for convenience)
combs <- t(combn(prods,2))

# turn the combs into a data.frame with column P1,P2
res <- as.data.frame(combs)
colnames(res) <- c('P1','P2')  

# for each combination row :
# - subset sales selecting only the products in the row
# - count the number of rows summing to 2 (if sum=2 the 2 products have been sold together)
#   N.B.: length(which(logical_condition)) can be implemented with sum(logical_condition) 
#         since TRUE and FALSE are automatically coerced to 1 and 0
# finally add the resulting vector to the newly created data.frame
res$count <- apply(combs,1,function(comb){sum(rowSums(sales[,comb])==2)})

> res
  P1 P2 count
1  A  B     2
2  A  C     0
3  A  D     1
4  B  C     1
5  B  D     1
6  C  D     0
    #x1 is your dataframe
x1<-structure(list(ID = 123:126, A = c(0L, 1L, 1L, 0L), B = c(1L, 
1L, 1L, 0L), C = c(1L, 0L, 0L, 0L), D = c(0L, 0L, 1L, 1L)), .Names = c("ID", 
"A", "B", "C", "D"), class = "data.frame", row.names = c(NA, 
-4L))
#get the combination of all colnames but the first ("ID")
    k1<-combn(colnames(x1[,-1]),2)
#create two lists a1 and a2 so that we can iterate over each element 
    a1<-as.list(k1[seq(1,length(k1),2)])
    a2<-as.list(k1[seq(2,length(k1),2)])
# your own functions with varying i and j
     mapply(function(i,j) length(x1[which(x1[,i] == 1 & x1 [,j] == 1 ),1]),a1,a2)
    [1] 2 0 1 1 1 0

You can use matrix multiplication: 您可以使用矩阵乘法:

m <- as.matrix(d[-1])
z <- melt(crossprod(m,m))
z[as.integer(z$X1) < as.integer(z$X2),]
#    X1 X2 value
# 5   A  B     2
# 9   A  C     0
# 10  B  C     1
# 13  A  D     1
# 14  B  D     1
# 15  C  D     0

where d is your data frame: 其中d是您的数据框:

d <- structure(list(ID = 123:126, A = c(0L, 1L, 1L, 0L), B = c(1L, 1L, 1L, 0L), C = c(1L, 0L, 0L, 0L), D = c(0L, 0L, 1L, 1L)), .Names = c("ID", "A", "B", "C", "D"), class = "data.frame", row.names = c(NA, -4L))

[Update] [更新]

To calculate the product affinity, you can do: 要计算产品亲和力,您可以:

z2 <- subset(z,X1!=X2)
do.call(rbind,lapply(split(z2,z2$X1),function(d) d[which.max(d$value),]))
#   X1 X2 value
# A  A  B     2
# B  B  A     2
# C  C  B     1
# D  D  A     1

You might want to take a look at the arules package. 你可能想看一下arules包。 It does exactly what you are looking for. 它完全符合您的要求。 Provides the infrastructure for representing, manipulating and analyzing transaction data and patterns (frequent itemsets and association rules). 提供用于表示,处理和分析交易数据和模式(频繁项目集和关联规则)的基础结构。 Also provides interfaces to C implementations of the association mining algorithms Apriori and Eclat by C. Borgelt. 还提供了C. Borgelt的关联挖掘算法Apriori和Eclat的C实现的接口。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 在 R 中,如何在不指定列名的情况下根据一列的值对所有其他列求和? - In R, how to sum all other columns based on value of one column, without specifying column names? R-根据其他两列的比较将值分配给一列 - R - assigning value to one column based on a comparison of two other columns 逐行检查一列中的值是否存在于其他多列 R - check rowwise if value in one column is present in multiple other columns R R:按一列分组,然后在其他任何列中返回值大于0的第一行,然后返回此行之后的所有行 - R: Group by one column, and return the first row that has a value greater than 0 in any of the other columns and then return all rows after this row 使用IF语句比较两列的值,如何根据语句将一列中的值交换为另一列 - Using IF statement to compare values of two columns, how to swap value in one column for other based on statement 在 R 中创建具有其他列值的列 - Creating a column with value of other columns in R 如果它们不是NA,则将一列粘贴到所有其他列 - Paste one column to all other columns if they are not NA R获得基于其他几列的一列的值和一列的最大值? - R get the value of one column that's based on several other columns and the max of one? 如何使用R将包含一列相同值但其他列不同的行转换为一行? - How to convert rows that contain same value for one column but different for other columns into one single row using R? 检查一列中的值是否存在于一列中的其他两列中 dataframe R - Check to see if value from one column is present in two other columns in one dataframe R
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM