简体   繁体   中英

Finding frequency of pairs within a dataset - R

I have the following data:

Name    Event

John    EventA
Anna    EventA
Dave    EventA
Stew    EventB
John    EventB
Anna    EventB
John    EventC
Stew    EventC
Dave    EventC

I want to find out who do the same events the most. So for example in the example above I want it to return that the top 3 most similar pairs are: John & Anna, John & Dave, John & Stew.

I assume I'd need to make a frequency matrix like the one below

Name    John    Anna    Dave     Stew
John     0       2       2        2
Anna     2       0       1        1
Dave     2       1       0        1
Stew     2       1       1        0

And then transform it to something like this:

Pair          Frequency

John Anna         2
John Dave         2
John Stew         2
Anna Dave         1
Anna Stew         1
Dave Stew         1

But I have no idea how to go about that.

I'm working with R, so if anyone knows a way of doing this, it'd be a huge help!

You can use table of base and melt of reshape2 package.

#DATA
df = structure(list(Name = c("John", "Anna", "Dave", "Stew", "John", 
"Anna", "John", "Stew", "Dave"), Event = c("EventA", "EventA", 
"EventA", "EventB", "EventB", "EventB", "EventC", "EventC", "EventC"
)), .Names = c("Name", "Event"), row.names = c(NA, -9L), class = "data.frame")

#Get Pairwise Frequency
a = table(df) %*% t(table(df))    
a
#      Name
#Name   Anna Dave John Stew
#  Anna    2    1    2    1
#  Dave    1    2    2    1
#  John    2    2    3    2
#  Stew    1    1    2    2

#If you want, set diagonal elements to zero (From Karthik's comment)
#diag(a) <- 0 

library(reshape2)
output = data.frame(melt(a))
colnames(output) = c("Name1", "Name2", "Value")

#Remove the pair with oneself
output = output[-(which(output$Name1 == output$Name2)),]
output
#   Name1 Name2 Value
#2   Dave  Anna     1
#3   John  Anna     2
#4   Stew  Anna     1
#5   Anna  Dave     1
#7   John  Dave     2
#8   Stew  Dave     1
#9   Anna  John     2
#10  Dave  John     2
#12  Stew  John     2
#13  Anna  Stew     1
#14  Dave  Stew     1
#15  John  Stew     2

#YOU CAN PASTE 'NAME1' and 'NAME2' to a 'PAIR' if necessary
#output$PAIR = apply(output, 1, function(x) paste(sort(x[1:2]), collapse = " "))

This seems to be a little closer to what you are asking for, and uses only functions in base R. Using the "df" from @db's answer:

x <- as.table(tcrossprod(table(df)))
x[lower.tri(x, diag = TRUE)] <- NA
na.omit(data.frame(x))
#    Name Name.1 Freq
# 5  Anna   Dave    1
# 9  Anna   John    2
# 10 Dave   John    2
# 13 Anna   Stew    1
# 14 Dave   Stew    1
# 15 John   Stew    2

Using NA for the diag and the lower.tri allows us to easily remove the values we are not interested in.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM