[英]In R, compare one column value to all other columns
I'm very new to R and I have a question which might be very simple for experts here. 我对R很新,我有一个问题,对于这里的专家来说可能非常简单。
Let's say i have a table "sales", which includes 4 customer IDs (123-126) and 4 products (A,B,C,D). 假设我有一个表“sales”,其中包括4个客户ID(123-126)和4个产品(A,B,C,D)。
ID A B C D
123 0 1 1 0
124 1 1 0 0
125 1 1 0 1
126 0 0 0 1
I want to calculate the overlaps between products. 我想计算产品之间的重叠。 So for A, the number of IDs that have both A and B will be 2. Similarly, the overlap between A and C will be 0 and that between A and D will be 1. Here is my code for A and B overlap: 因此对于A,具有A和B的ID的数量将是2.类似地,A和C之间的重叠将是0,并且A和D之间的重叠将是1.这是我的A和B重叠的代码:
overlap <- sales [which(sales [,"A"] == 1 & sales [,"B"] == 1 ),]
countAB <- count(overlap,"ID")
I want to repeat this calculation for all 4 products,so A overlaps with B,C,D and B overlaps with A,C,D, etc...How can i change the code to accomplish this? 我想对所有4个产品重复这个计算,所以A与B,C,D和B重叠,与A,C,D等重叠......我如何更改代码来实现这一目标?
I want the final output to be the number of IDs for each two-product combination. 我希望最终输出是每个双产品组合的ID数。 It's product affinity exercise and i want to find out for one product, which product sold the most with it. 这是产品亲和力练习,我想找出一种产品,哪种产品最畅销。 For example, for A, the most sold products with it will be B, followed by D, then C. Some sorting needs to be added to the code to get to this i think. 例如,对于A,使用它的销售最多的产品将是B,然后是D,然后是C.需要将一些排序添加到代码中以实现此目的。
Thanks for your help! 谢谢你的帮助!
Here's a possible solution : 这是一个可能的解决方案:
sales <-
read.csv(text=
"ID,A,B,C,D
123,0,1,1,0
124,1,1,0,0
125,1,1,0,1
126,0,0,0,1")
# get product names
prods <- colnames(sales)[-1]
# generate all products pairs (and transpose the matrix for convenience)
combs <- t(combn(prods,2))
# turn the combs into a data.frame with column P1,P2
res <- as.data.frame(combs)
colnames(res) <- c('P1','P2')
# for each combination row :
# - subset sales selecting only the products in the row
# - count the number of rows summing to 2 (if sum=2 the 2 products have been sold together)
# N.B.: length(which(logical_condition)) can be implemented with sum(logical_condition)
# since TRUE and FALSE are automatically coerced to 1 and 0
# finally add the resulting vector to the newly created data.frame
res$count <- apply(combs,1,function(comb){sum(rowSums(sales[,comb])==2)})
> res
P1 P2 count
1 A B 2
2 A C 0
3 A D 1
4 B C 1
5 B D 1
6 C D 0
#x1 is your dataframe
x1<-structure(list(ID = 123:126, A = c(0L, 1L, 1L, 0L), B = c(1L,
1L, 1L, 0L), C = c(1L, 0L, 0L, 0L), D = c(0L, 0L, 1L, 1L)), .Names = c("ID",
"A", "B", "C", "D"), class = "data.frame", row.names = c(NA,
-4L))
#get the combination of all colnames but the first ("ID")
k1<-combn(colnames(x1[,-1]),2)
#create two lists a1 and a2 so that we can iterate over each element
a1<-as.list(k1[seq(1,length(k1),2)])
a2<-as.list(k1[seq(2,length(k1),2)])
# your own functions with varying i and j
mapply(function(i,j) length(x1[which(x1[,i] == 1 & x1 [,j] == 1 ),1]),a1,a2)
[1] 2 0 1 1 1 0
You can use matrix multiplication: 您可以使用矩阵乘法:
m <- as.matrix(d[-1])
z <- melt(crossprod(m,m))
z[as.integer(z$X1) < as.integer(z$X2),]
# X1 X2 value
# 5 A B 2
# 9 A C 0
# 10 B C 1
# 13 A D 1
# 14 B D 1
# 15 C D 0
where d
is your data frame: 其中d
是您的数据框:
d <- structure(list(ID = 123:126, A = c(0L, 1L, 1L, 0L), B = c(1L, 1L, 1L, 0L), C = c(1L, 0L, 0L, 0L), D = c(0L, 0L, 1L, 1L)), .Names = c("ID", "A", "B", "C", "D"), class = "data.frame", row.names = c(NA, -4L))
[Update] [更新]
To calculate the product affinity, you can do: 要计算产品亲和力,您可以:
z2 <- subset(z,X1!=X2)
do.call(rbind,lapply(split(z2,z2$X1),function(d) d[which.max(d$value),]))
# X1 X2 value
# A A B 2
# B B A 2
# C C B 1
# D D A 1
You might want to take a look at the arules package. 你可能想看一下arules包。 It does exactly what you are looking for. 它完全符合您的要求。 Provides the infrastructure for representing, manipulating and analyzing transaction data and patterns (frequent itemsets and association rules). 提供用于表示,处理和分析交易数据和模式(频繁项目集和关联规则)的基础结构。 Also provides interfaces to C implementations of the association mining algorithms Apriori and Eclat by C. Borgelt. 还提供了C. Borgelt的关联挖掘算法Apriori和Eclat的C实现的接口。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.