簡體   English   中英

R中的pairwise_similarity:並非所有行都與其他所有行相比

[英]pairwise_similarity in R: not all rows compared to all others

我對 R 中 pairwise_similarity 函數的理解是,它將每個項目相互比較。

例如,如果您有 3 個文本項:

  • 第 1 項將與第 2 項和第 3 項進行比較

  • 第 2 項將與第 1 項和第 3 項進行比較

  • 第 3 項將與第 1 項和第 2 項進行比較

然而,這似乎並沒有在這里發生:

這是我的數據:

d <- data.frame(column_id=1:3, description= c("red and yellow", "yellow and blue", "green and black"))

d

 column_id     description
         1    red and yellow
         2    yellow and blue
         3    green and black   # notice how item 3 has no common words with the other two


# unnest the words and remove stop words 

d_un_nest  <- d %>%
              tidytext::unnest_tokens(output = "word",
                                      input = "description",
                                      token = "words") %>%
                        dplyr::anti_join(tidytext::stop_words) %>%
                        dplyr::count(column_id, word, sort = TRUE) %>%
                        tidytext::bind_tf_idf(word, column_id, n)

# complete pairwise similarity

d_similarity <- widyr::pairwise_similarity(d_un_nest, column_id, word, tf_idf)


d_similarity

# A tibble: 2 × 3

  item1 item2 similarity
    2     1      0.120
    1     2      0.120

請注意第 3 項如何與 1 和 2 進行比較? 為什么是這樣? 如果我在第 3 項中添加一個詞,這與第 1 項和第 3 項相同,它確實會增加一些比較,但又不是全部:

d <- data.frame(column_id=1:3, description= c("red and yellow", "yellow and blue", "blue and black"))

d


column_id     description
        1     red and yellow
        2     yellow and blue
        3     blue and black

d_un_nest  <- d %>%
              tidytext::unnest_tokens(output = "word",
                                      input = "description",
                                      token = "words") %>%
                        dplyr::anti_join(tidytext::stop_words) %>%
                        dplyr::count(column_id, word, sort = TRUE) %>%
                        tidytext::bind_tf_idf(word, column_id, n)



d_similarity <- widyr::pairwise_similarity(d_un_nest, column_id, word, tf_idf)


d_similarity

# A tibble: 4 × 3
  item1 item2 similarity
    2     1      0.245
    1     2      0.245
    3     2      0.245   # 3 not compared to 1 at any point - why?
    2     3      0.245

我缺乏對成對相似性的理解嗎? 除非默認情況下,如果兩個文本塊的共同詞為零,那么它們的相似度為零,那么該行是否被省略? 有誰知道這是否可以作為答案?

我無法找到這方面的文檔。

使行消失的不是“相似性== 0”。 所有項目中出現的單詞的idf = 0,因此tf-idf也為零。 因此,如果我們在所有三個項目中添加一個“常見”詞,例如粉色

######################################################
######################################################
d <- data.frame(column_id = 1:3, 
                description = c("red and yellow pink", 
                                "yellow and blue pink", 
                                "green and black pink"))   ### here
d_un_nest <- d %>%
  tidytext::unnest_tokens(output = "word",
                          input = "description",
                          token = "words") %>%
  dplyr::anti_join(tidytext::stop_words) %>%
  dplyr::count(column_id, word, sort = TRUE) %>%
  tidytext::bind_tf_idf(word, column_id, n)
(d_similarity <- widyr::pairwise_similarity(d_un_nest, column_id, word, tf_idf))

給出:

# A tibble: 6 × 3
  item1 item2 similarity
  <int> <int>      <dbl>
1     2     1      0.120
2     3     1      0    
3     1     2      0.120
4     3     2      0    
5     1     3      0    
6     2     3      0   

如果我們將“普通”粉色替換為“獨特”棕色,這樣第 3 項與第 1 項或第 2 項沒有共同詞:

######################################################
######################################################
d <- data.frame(column_id = 1:3, 
                description = c("red and yellow pink", 
                                "yellow and blue pink", 
                                "green and black brown")) ### here

d_un_nest <- d %>%
  tidytext::unnest_tokens(output = "word",
                          input = "description",
                          token = "words") %>%
  dplyr::anti_join(tidytext::stop_words) %>%
  dplyr::count(column_id, word, sort = TRUE) %>%
  tidytext::bind_tf_idf(word, column_id, n)

(d_similarity <- widyr::pairwise_similarity(d_un_nest, column_id, word, tf_idf))

給出:

# A tibble: 2 × 3
  item1 item2 similarity
  <int> <int>      <dbl>
1     2     1      0.214
2     1     2      0.214

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM