[英]How to find the similarity in R?
I have a data set as I've shown below:我有一个数据集,如下所示:
It shows which book is sold by which shop.它显示了哪家商店出售了哪本书。
df <- tribble(
~shop, ~book_id,
"A", 1,
"B", 1,
"C", 2,
"D", 3,
"E", 3,
"A", 3,
"B", 4,
"C", 5,
"D", 1,
)
In the data set,在数据集中,
So now, I want to calculate the Jaccard index here.所以现在,我想在这里计算 Jaccard 指数。 For instance, let's take shop A and shop B .
例如,让我们以shop A和shop B为例。 There are three different books that are sold by A and B (book 1, book 3, book 4).
A 和 B 出售三本不同的书(书 1、书 3、书 4)。 However, only one product is sold by both shops (this is product 1).
但是,两家商店只销售一种产品(这是产品 1)。 So, the Jaccard index here should be 33.3% (1/3) .
所以,这里的Jaccard 指数应该是33.3% (1/3) 。
Here is the sample of the desired data:这是所需数据的示例:
df <- tribble(
~shop_1, ~shop_2, ~similarity,
"A", "B", 33.3,
"B", "A", 33.33,
"A", "C", 0,
"C", "A", 0,
"A", "D", 100,
"D", "A", 100,
"A", "E", 50,
"E", "A", 50,
)
Any comments/assistance really appreciated.非常感谢任何评论/帮助。 Thanks in advance.
提前致谢。
I don't know about a package but you can write your own function.我不知道 package 但您可以编写自己的 function。 I guess by similarity you mean something like this:
我猜你所说的相似性是这样的:
similarity <- function(x, y) {
k <- length(intersect(x, y))
n <- length(union(x, y))
k / n
}
Then you can use tidyr::crossing
to merge the same data frame with itself然后您可以使用
tidyr::crossing
将相同的数据框与自身合并
dfg <- df %>% group_by(shop) %>% summarise(books = list(book_id))
crossing(dfg %>% set_names(paste0, "_A"), dfg %>% set_names(paste0, "_B")) %>%
filter(shop_A != shop_B) %>%
mutate(similarity = map2_dbl(books_A, books_B, similarity))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.