R中的pairwise_similarity：並非所有行都與其他所有行相比

Question

我對 R 中 pairwise_similarity 函數的理解是，它將每個項目相互比較。

例如，如果您有 3 個文本項：

第 1 項將與第 2 項和第 3 項進行比較
第 2 項將與第 1 項和第 3 項進行比較
第 3 項將與第 1 項和第 2 項進行比較

然而，這似乎並沒有在這里發生：

這是我的數據：

d <- data.frame(column_id=1:3, description= c("red and yellow", "yellow and blue", "green and black"))

d

 column_id     description
         1    red and yellow
         2    yellow and blue
         3    green and black   # notice how item 3 has no common words with the other two


# unnest the words and remove stop words 

d_un_nest  <- d %>%
              tidytext::unnest_tokens(output = "word",
                                      input = "description",
                                      token = "words") %>%
                        dplyr::anti_join(tidytext::stop_words) %>%
                        dplyr::count(column_id, word, sort = TRUE) %>%
                        tidytext::bind_tf_idf(word, column_id, n)

# complete pairwise similarity

d_similarity <- widyr::pairwise_similarity(d_un_nest, column_id, word, tf_idf)


d_similarity

# A tibble: 2 × 3

  item1 item2 similarity
    2     1      0.120
    1     2      0.120

請注意第 3 項如何與 1 和 2 進行比較？ 為什么是這樣？ 如果我在第 3 項中添加一個詞，這與第 1 項和第 3 項相同，它確實會增加一些比較，但又不是全部：

d <- data.frame(column_id=1:3, description= c("red and yellow", "yellow and blue", "blue and black"))

d


column_id     description
        1     red and yellow
        2     yellow and blue
        3     blue and black

d_un_nest  <- d %>%
              tidytext::unnest_tokens(output = "word",
                                      input = "description",
                                      token = "words") %>%
                        dplyr::anti_join(tidytext::stop_words) %>%
                        dplyr::count(column_id, word, sort = TRUE) %>%
                        tidytext::bind_tf_idf(word, column_id, n)



d_similarity <- widyr::pairwise_similarity(d_un_nest, column_id, word, tf_idf)


d_similarity

# A tibble: 4 × 3
  item1 item2 similarity
    2     1      0.245
    1     2      0.245
    3     2      0.245   # 3 not compared to 1 at any point - why?
    2     3      0.245

我缺乏對成對相似性的理解嗎？ 除非默認情況下，如果兩個文本塊的共同詞為零，那么它們的相似度為零，那么該行是否被省略？ 有誰知道這是否可以作為答案？

Answer 1

我無法找到這方面的文檔。

使行消失的不是“相似性== 0”。 所有項目中出現的單詞的idf = 0，因此tf-idf也為零。 因此，如果我們在所有三個項目中添加一個“常見”詞，例如粉色：

######################################################
######################################################
d <- data.frame(column_id = 1:3, 
                description = c("red and yellow pink", 
                                "yellow and blue pink", 
                                "green and black pink"))   ### here
d_un_nest <- d %>%
  tidytext::unnest_tokens(output = "word",
                          input = "description",
                          token = "words") %>%
  dplyr::anti_join(tidytext::stop_words) %>%
  dplyr::count(column_id, word, sort = TRUE) %>%
  tidytext::bind_tf_idf(word, column_id, n)
(d_similarity <- widyr::pairwise_similarity(d_un_nest, column_id, word, tf_idf))

給出：

# A tibble: 6 × 3
  item1 item2 similarity
  <int> <int>      <dbl>
1     2     1      0.120
2     3     1      0    
3     1     2      0.120
4     3     2      0    
5     1     3      0    
6     2     3      0

如果我們將“普通”粉色替換為“獨特”棕色，這樣第 3 項與第 1 項或第 2 項沒有共同詞：

######################################################
######################################################
d <- data.frame(column_id = 1:3, 
                description = c("red and yellow pink", 
                                "yellow and blue pink", 
                                "green and black brown")) ### here

d_un_nest <- d %>%
  tidytext::unnest_tokens(output = "word",
                          input = "description",
                          token = "words") %>%
  dplyr::anti_join(tidytext::stop_words) %>%
  dplyr::count(column_id, word, sort = TRUE) %>%
  tidytext::bind_tf_idf(word, column_id, n)

(d_similarity <- widyr::pairwise_similarity(d_un_nest, column_id, word, tf_idf))

給出：

# A tibble: 2 × 3
  item1 item2 similarity
  <int> <int>      <dbl>
1     2     1      0.214
2     1     2      0.214

R中的pairwise_similarity：並非所有行都與其他所有行相比

問題描述

1 個解決方案

解決方案1
0 2022-06-22 13:30:47

R中的pairwise_similarity：並非所有行都與其他所有行相比

問題描述

1 個解決方案

解決方案1 0 2022-06-22 13:30:47

解決方案1
0 2022-06-22 13:30:47