在数据集中查找对的频率-R

Question

I have the following data: 我有以下数据：

Name    Event

John    EventA
Anna    EventA
Dave    EventA
Stew    EventB
John    EventB
Anna    EventB
John    EventC
Stew    EventC
Dave    EventC

I want to find out who do the same events the most. 我想找出谁做最多的相同活动。 So for example in the example above I want it to return that the top 3 most similar pairs are: John & Anna, John & Dave, John & Stew. 因此，例如在上面的示例中，我希望它返回最相似的前三对：John＆Anna，John＆Dave，John＆Stew。

I assume I'd need to make a frequency matrix like the one below 我认为我需要制作一个频率矩阵，如下所示

Name    John    Anna    Dave     Stew
John     0       2       2        2
Anna     2       0       1        1
Dave     2       1       0        1
Stew     2       1       1        0

And then transform it to something like this: 然后将其转换为如下所示：

Pair          Frequency

John Anna         2
John Dave         2
John Stew         2
Anna Dave         1
Anna Stew         1
Dave Stew         1

But I have no idea how to go about that. 但是我不知道该怎么做。

I'm working with R, so if anyone knows a way of doing this, it'd be a huge help! 我正在使用R，所以如果有人知道这样做的方法，那将是巨大的帮助！

Answer 1

You can use table of base and melt of reshape2 package. 您可以使用reshape2包的基础和melt table 。

#DATA
df = structure(list(Name = c("John", "Anna", "Dave", "Stew", "John", 
"Anna", "John", "Stew", "Dave"), Event = c("EventA", "EventA", 
"EventA", "EventB", "EventB", "EventB", "EventC", "EventC", "EventC"
)), .Names = c("Name", "Event"), row.names = c(NA, -9L), class = "data.frame")

#Get Pairwise Frequency
a = table(df) %*% t(table(df))    
a
#      Name
#Name   Anna Dave John Stew
#  Anna    2    1    2    1
#  Dave    1    2    2    1
#  John    2    2    3    2
#  Stew    1    1    2    2

#If you want, set diagonal elements to zero (From Karthik's comment)
#diag(a) <- 0 

library(reshape2)
output = data.frame(melt(a))
colnames(output) = c("Name1", "Name2", "Value")

#Remove the pair with oneself
output = output[-(which(output$Name1 == output$Name2)),]
output
#   Name1 Name2 Value
#2   Dave  Anna     1
#3   John  Anna     2
#4   Stew  Anna     1
#5   Anna  Dave     1
#7   John  Dave     2
#8   Stew  Dave     1
#9   Anna  John     2
#10  Dave  John     2
#12  Stew  John     2
#13  Anna  Stew     1
#14  Dave  Stew     1
#15  John  Stew     2

#YOU CAN PASTE 'NAME1' and 'NAME2' to a 'PAIR' if necessary
#output$PAIR = apply(output, 1, function(x) paste(sort(x[1:2]), collapse = " "))

Answer 2

This seems to be a little closer to what you are asking for, and uses only functions in base R. Using the "df" from @db's answer: 这似乎更接近您的要求，并且仅在base R中使用函数。使用@db答案中的“ df”：

x <- as.table(tcrossprod(table(df)))
x[lower.tri(x, diag = TRUE)] <- NA
na.omit(data.frame(x))
#    Name Name.1 Freq
# 5  Anna   Dave    1
# 9  Anna   John    2
# 10 Dave   John    2
# 13 Anna   Stew    1
# 14 Dave   Stew    1
# 15 John   Stew    2

Using NA for the diag and the lower.tri allows us to easily remove the values we are not interested in. 对diag和lower.tri使用NA可以使我们轻松删除不感兴趣的值。

在数据集中查找对的频率-R

问题描述

2 个解决方案

解决方案1
2 2017-02-19 19:59:22

解决方案2
1 2017-02-20 02:41:38

在数据集中查找对的频率-R

问题描述

2 个解决方案

解决方案1 2 2017-02-19 19:59:22

解决方案2 1 2017-02-20 02:41:38

解决方案1
2 2017-02-19 19:59:22

解决方案2
1 2017-02-20 02:41:38