I have the following data:
Name Event
John EventA
Anna EventA
Dave EventA
Stew EventB
John EventB
Anna EventB
John EventC
Stew EventC
Dave EventC
I want to find out who do the same events the most. So for example in the example above I want it to return that the top 3 most similar pairs are: John & Anna, John & Dave, John & Stew.
I assume I'd need to make a frequency matrix like the one below
Name John Anna Dave Stew
John 0 2 2 2
Anna 2 0 1 1
Dave 2 1 0 1
Stew 2 1 1 0
And then transform it to something like this:
Pair Frequency
John Anna 2
John Dave 2
John Stew 2
Anna Dave 1
Anna Stew 1
Dave Stew 1
But I have no idea how to go about that.
I'm working with R, so if anyone knows a way of doing this, it'd be a huge help!
You can use table
of base and melt
of reshape2
package.
#DATA
df = structure(list(Name = c("John", "Anna", "Dave", "Stew", "John",
"Anna", "John", "Stew", "Dave"), Event = c("EventA", "EventA",
"EventA", "EventB", "EventB", "EventB", "EventC", "EventC", "EventC"
)), .Names = c("Name", "Event"), row.names = c(NA, -9L), class = "data.frame")
#Get Pairwise Frequency
a = table(df) %*% t(table(df))
a
# Name
#Name Anna Dave John Stew
# Anna 2 1 2 1
# Dave 1 2 2 1
# John 2 2 3 2
# Stew 1 1 2 2
#If you want, set diagonal elements to zero (From Karthik's comment)
#diag(a) <- 0
library(reshape2)
output = data.frame(melt(a))
colnames(output) = c("Name1", "Name2", "Value")
#Remove the pair with oneself
output = output[-(which(output$Name1 == output$Name2)),]
output
# Name1 Name2 Value
#2 Dave Anna 1
#3 John Anna 2
#4 Stew Anna 1
#5 Anna Dave 1
#7 John Dave 2
#8 Stew Dave 1
#9 Anna John 2
#10 Dave John 2
#12 Stew John 2
#13 Anna Stew 1
#14 Dave Stew 1
#15 John Stew 2
#YOU CAN PASTE 'NAME1' and 'NAME2' to a 'PAIR' if necessary
#output$PAIR = apply(output, 1, function(x) paste(sort(x[1:2]), collapse = " "))
This seems to be a little closer to what you are asking for, and uses only functions in base R. Using the "df" from @db's answer:
x <- as.table(tcrossprod(table(df)))
x[lower.tri(x, diag = TRUE)] <- NA
na.omit(data.frame(x))
# Name Name.1 Freq
# 5 Anna Dave 1
# 9 Anna John 2
# 10 Dave John 2
# 13 Anna Stew 1
# 14 Dave Stew 1
# 15 John Stew 2
Using NA
for the diag
and the lower.tri
allows us to easily remove the values we are not interested in.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.