逐字確定多詞串的（異）相似性

Question

我正在研究多字字符串中的字符串距離，就像在這個玩具數據中一樣：

df <- data.frame(
  col1 = c("ab", "ab bc", "yyyy", "yyyy pw hhhh", "wstjz")
)

我想逐字確定每一行與下一行的（不同）相似性。 我使用這個代碼：

library(dplyr)
library(tidyr)
library(stringdist)
df %>%
  mutate(col2 = lead(col1, 1),
         id = row_number()) %>%
  pivot_longer(
    # select columns:
    cols = c(col1, col2),
    # determine name of new column:
    names_to = c(".value", "Col_N"), 
    # define capture groups (...) for new column:
    names_pattern = "^([a-z]+)([0-9])$") %>%
  # separate each word into its own row:
  separate_rows(col, sep = "\\s") %>%
  # recast into wider format:
  pivot_wider(id_cols = c(id, Col_N), 
              names_from = Col_N, 
              values_from = col) %>%
  # unnest lists:
  unnest(.) %>%
  # calculate string distance:
  mutate(distance = stringdist(`1`, `2`)) %>%
  group_by(id) %>%
  # reconnect same-string words and distance values:
  summarise(col1 = str_c(unique(`1`), collapse = " "),
            col2 = str_c(unique(`2`), collapse = " "),
            distance = str_c(distance, collapse = ", "))
# A tibble: 5 x 4
     id col1         col2         distance
* <int> <chr>        <chr>        <chr>   
1     1 ab           ab bc        0, 2    
2     2 ab bc        yyyy         4, 4    
3     3 yyyy         yyyy pw hhhh 0, 4, 4 
4     4 yyyy pw hhhh wstjz        5, 5, 5 
5     5 wstjz        NA           NA

雖然結果似乎沒問題，但它存在三個問題：a) 有很多警告，b) 代碼看起來很復雜，c) distance是字符類型。 所以我想知道是否有更好的方法來逐字確定字符串的（dis）相似性？

Answer 1

一個辦法：

df <- data.frame(
  col1 = col1 = c("ab", "ab bc", "yyyy", "yyyy pw hhhh", "wstjz"),
  stringsAsFactors=FALSE
)

comps = function(a.row){
  paste(stringdist(unlist(strsplit(as.character(a.row[1]), ' ')), 
                   unlist(strsplit(as.character(a.row[2]), ' '))), 
        collapse = ' ')
  
}
df %>%
  mutate(col2 = lead(col1, 1)) %>%
         mutate(distance = apply(., 1, comps))

應該有一種方法as.character在strsplit函數中使用strsplit
我不確定您是否可以在數據框中有一列向量，這可能是所有警告和距離字符類型的原因。 我在這里將距離轉換為字符串以將所有值保留在同一列中。

Answer 2

這樣的事情怎么樣：

mydf <- data.frame(
  col1 = c("ab", "ab bc", "yyyy", "yyyy pw hhhh", "wstjz")
)
mydf


library(dplyr)
library(stringdist)
mydf %>% 
  mutate(col1_lead = lead(col1)) %>% 
  apply(1, function(x){
    stringdist(
      unlist(strsplit(x["col1"], " ")), 
      unlist(strsplit(x["col1_lead"], " "))
    )}
  ) %>% 
  cbind() %>% 
  `colnames<-`("distance") %>% 
  cbind(mydf)

Answer 3

以下是我簡單的誠實想法。
我制作了包含單詞的 list-cols 並使用unlist逐行計算 dist （因為 stringdist 需要向量）。 並將 dist 保留為列表列。

ans <- df %>%
  as_tibble() %>% 
  mutate(id = row_number(),   # not use
         col2 = lead(col1, 1),
         sep_col1 = str_split(col1, " "),
         sep_col2 = str_split(col2, " ")) %>%    # or str_split(lead(col1, 1))
  rowwise() %>% 
  mutate(dist = list(stringdist(unlist(sep_col1), unlist(sep_col2))),
         for_just_look = paste(dist, collapse = ", ")) %>% 
  ungroup()

ans

#  col1            id col2         sep_col1  sep_col2  dist     for_just_look
#  <chr>        <int> <chr>        <list>    <list>    <list>    <chr>   
# 1 ab               1 ab bc        <chr [1]> <chr [2]> <dbl [2]> 0, 2    
# 2 ab bc            2 yyyy         <chr [2]> <chr [1]> <dbl [2]> 4, 4    
# 3 yyyy             3 yyyy pw hhhh <chr [1]> <chr [3]> <dbl [3]> 0, 4, 4 
# 4 yyyy pw hhhh     4 wstjz        <chr [3]> <chr [1]> <dbl [3]> 5, 5, 5 
# 5 wstjz            5 NA           <chr [1]> <chr [1]> <dbl [1]> NA

Answer 4

如果沒有我在下面的評論，這就是直截了當的。

library(data.table)
setDT(df)

df[, col1 := list(str_split(col1, " "))]
df[, col2 := lead(col1, 1)]
df[, distance := lapply(.I, function(x) { stringdist(col1[x][[1]], col2[x][[1]]) })]

小心任何類似 stringdist 的函數，在龐大的數據集上進行所有比較是非常激烈的。 還要記住你將使用值距離做什么。 你真的對距離感興趣嗎？ 或者您是否對所有距離 < x 感興趣？ 如果與 axxxxxxxxxxxxxxx 相比，很可能 axxxxxxxxxxxxxxx 你不認為接近匹配，但是你可以通過字符串的長度看到這種差異，例如，它比實際距離需要更少的資源來計算。

盲目地逐行計算也是一種計算浪費，讓我們制作一個稍長的樣本集。

c("ab", "ab bc", "yyyy", "yyyy pw hhhh", "yyyy", "yyyy pw hhhh", "wstjz", "wstjz")

在這里，您將計算 yyyy 和 yyyy 之間的距離應該執行一次（實際上您應該首先通過“相等”捕獲它們），3x yyyy 和 hhhh / hhhh 和 yyyy。

對於小數據集，您可能不必擔心，但是對於大數據集和更長的字符串......它可能會很快變得混亂/緩慢。

逐字確定多詞串的（異）相似性

問題描述

4 個解決方案

解決方案1
2 2021-10-22 08:52:05

解決方案2
1 2021-10-22 09:31:44

解決方案3
0 2021-10-22 09:06:09

解決方案4
0 2021-10-22 09:12:15

逐字確定多詞串的（異）相似性

問題描述

4 個解決方案

解決方案1 2 2021-10-22 08:52:05

解決方案2 1 2021-10-22 09:31:44

解決方案3 0 2021-10-22 09:06:09

解決方案4 0 2021-10-22 09:12:15

解決方案1
2 2021-10-22 08:52:05

解決方案2
1 2021-10-22 09:31:44

解決方案3
0 2021-10-22 09:06:09

解決方案4
0 2021-10-22 09:12:15