简体   繁体   中英

How to find the similarity in R?

I have a data set as I've shown below:

It shows which book is sold by which shop.

df <- tribble(
 ~shop,  ~book_id,  
  "A",       1,      
  "B",       1,      
  "C",       2,      
  "D",       3,      
  "E",       3,      
  "A",       3,      
  "B",       4,      
  "C",       5,      
  "D",       1,      
)

In the data set,

  • shop A sells 1, 3
  • shop B sells 1, 4
  • shop C sells 2, 5
  • shop D sells 3, 1
  • shop E sells only 3

So now, I want to calculate the Jaccard index here. For instance, let's take shop A and shop B . There are three different books that are sold by A and B (book 1, book 3, book 4). However, only one product is sold by both shops (this is product 1). So, the Jaccard index here should be 33.3% (1/3) .

Here is the sample of the desired data:

df <- tribble(
  ~shop_1, ~shop_2, ~similarity,  
    "A",    "B",         33.3,  
    "B",    "A",         33.33,
    "A",    "C",          0,
    "C",    "A",          0,
    "A",    "D",         100,
    "D",    "A",         100,
    "A",    "E",          50,
    "E",    "A",          50,

)

Any comments/assistance really appreciated. Thanks in advance.

I don't know about a package but you can write your own function. I guess by similarity you mean something like this:

similarity <- function(x, y) {
  k <- length(intersect(x, y))
  n <- length(union(x, y))
  k / n
}

Then you can use tidyr::crossing to merge the same data frame with itself

dfg <- df %>% group_by(shop) %>% summarise(books = list(book_id))
crossing(dfg %>% set_names(paste0, "_A"), dfg %>% set_names(paste0, "_B")) %>% 
  filter(shop_A != shop_B) %>% 
  mutate(similarity = map2_dbl(books_A, books_B, similarity))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM