[英]pairwise_similarity in R: not all rows compared to all others
我對 R 中 pairwise_similarity 函數的理解是,它將每個項目相互比較。
例如,如果您有 3 個文本項:
第 1 項將與第 2 項和第 3 項進行比較
第 2 項將與第 1 項和第 3 項進行比較
第 3 項將與第 1 項和第 2 項進行比較
然而,這似乎並沒有在這里發生:
這是我的數據:
d <- data.frame(column_id=1:3, description= c("red and yellow", "yellow and blue", "green and black"))
d
column_id description
1 red and yellow
2 yellow and blue
3 green and black # notice how item 3 has no common words with the other two
# unnest the words and remove stop words
d_un_nest <- d %>%
tidytext::unnest_tokens(output = "word",
input = "description",
token = "words") %>%
dplyr::anti_join(tidytext::stop_words) %>%
dplyr::count(column_id, word, sort = TRUE) %>%
tidytext::bind_tf_idf(word, column_id, n)
# complete pairwise similarity
d_similarity <- widyr::pairwise_similarity(d_un_nest, column_id, word, tf_idf)
d_similarity
# A tibble: 2 × 3
item1 item2 similarity
2 1 0.120
1 2 0.120
請注意第 3 項如何與 1 和 2 進行比較? 為什么是這樣? 如果我在第 3 項中添加一個詞,這與第 1 項和第 3 項相同,它確實會增加一些比較,但又不是全部:
d <- data.frame(column_id=1:3, description= c("red and yellow", "yellow and blue", "blue and black"))
d
column_id description
1 red and yellow
2 yellow and blue
3 blue and black
d_un_nest <- d %>%
tidytext::unnest_tokens(output = "word",
input = "description",
token = "words") %>%
dplyr::anti_join(tidytext::stop_words) %>%
dplyr::count(column_id, word, sort = TRUE) %>%
tidytext::bind_tf_idf(word, column_id, n)
d_similarity <- widyr::pairwise_similarity(d_un_nest, column_id, word, tf_idf)
d_similarity
# A tibble: 4 × 3
item1 item2 similarity
2 1 0.245
1 2 0.245
3 2 0.245 # 3 not compared to 1 at any point - why?
2 3 0.245
我缺乏對成對相似性的理解嗎? 除非默認情況下,如果兩個文本塊的共同詞為零,那么它們的相似度為零,那么該行是否被省略? 有誰知道這是否可以作為答案?
我無法找到這方面的文檔。
使行消失的不是“相似性== 0”。 所有項目中出現的單詞的idf
= 0,因此tf-idf
也為零。 因此,如果我們在所有三個項目中添加一個“常見”詞,例如粉色:
######################################################
######################################################
d <- data.frame(column_id = 1:3,
description = c("red and yellow pink",
"yellow and blue pink",
"green and black pink")) ### here
d_un_nest <- d %>%
tidytext::unnest_tokens(output = "word",
input = "description",
token = "words") %>%
dplyr::anti_join(tidytext::stop_words) %>%
dplyr::count(column_id, word, sort = TRUE) %>%
tidytext::bind_tf_idf(word, column_id, n)
(d_similarity <- widyr::pairwise_similarity(d_un_nest, column_id, word, tf_idf))
給出:
# A tibble: 6 × 3
item1 item2 similarity
<int> <int> <dbl>
1 2 1 0.120
2 3 1 0
3 1 2 0.120
4 3 2 0
5 1 3 0
6 2 3 0
如果我們將“普通”粉色替換為“獨特”棕色,這樣第 3 項與第 1 項或第 2 項沒有共同詞:
######################################################
######################################################
d <- data.frame(column_id = 1:3,
description = c("red and yellow pink",
"yellow and blue pink",
"green and black brown")) ### here
d_un_nest <- d %>%
tidytext::unnest_tokens(output = "word",
input = "description",
token = "words") %>%
dplyr::anti_join(tidytext::stop_words) %>%
dplyr::count(column_id, word, sort = TRUE) %>%
tidytext::bind_tf_idf(word, column_id, n)
(d_similarity <- widyr::pairwise_similarity(d_un_nest, column_id, word, tf_idf))
給出:
# A tibble: 2 × 3
item1 item2 similarity
<int> <int> <dbl>
1 2 1 0.214
2 1 2 0.214
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.