简体   繁体   English

逐字确定多词串的(异)相似性

[英]Determine (dis)similarity of multi-word strings on a word-by-word basis

I'm working on string distance in multi-word strings, as in this toy data:我正在研究多字字符串中的字符串距离,就像在这个玩具数据中一样:

df <- data.frame(
  col1 = c("ab", "ab bc", "yyyy", "yyyy pw hhhh", "wstjz")
)

I'd like to determine the (dis)similarity of each row compared to the next row on a word-by-word basis .我想逐字确定每一行与下一行的(不同)相似性 I use this code:我使用这个代码:

library(dplyr)
library(tidyr)
library(stringdist)
df %>%
  mutate(col2 = lead(col1, 1),
         id = row_number()) %>%
  pivot_longer(
    # select columns:
    cols = c(col1, col2),
    # determine name of new column:
    names_to = c(".value", "Col_N"), 
    # define capture groups (...) for new column:
    names_pattern = "^([a-z]+)([0-9])$") %>%
  # separate each word into its own row:
  separate_rows(col, sep = "\\s") %>%
  # recast into wider format:
  pivot_wider(id_cols = c(id, Col_N), 
              names_from = Col_N, 
              values_from = col) %>%
  # unnest lists:
  unnest(.) %>%
  # calculate string distance:
  mutate(distance = stringdist(`1`, `2`)) %>%
  group_by(id) %>%
  # reconnect same-string words and distance values:
  summarise(col1 = str_c(unique(`1`), collapse = " "),
            col2 = str_c(unique(`2`), collapse = " "),
            distance = str_c(distance, collapse = ", "))
# A tibble: 5 x 4
     id col1         col2         distance
* <int> <chr>        <chr>        <chr>   
1     1 ab           ab bc        0, 2    
2     2 ab bc        yyyy         4, 4    
3     3 yyyy         yyyy pw hhhh 0, 4, 4 
4     4 yyyy pw hhhh wstjz        5, 5, 5 
5     5 wstjz        NA           NA   

While the result seems to be okay, there are three problems with it: a) there are a number of warnings , b) the code seems quite convoluted , and c) distance is of type character.虽然结果似乎没问题,但它存在三个问题:a) 有很多警告,b) 代码看起来很复杂,c) distance是字符类型。 So I'm wondering if there's a better way to determine word-by-word the (dis)similiarity of strings?所以我想知道是否有更好的方法来逐字确定字符串的(dis)相似性?

A solution:一个办法:

df <- data.frame(
  col1 = col1 = c("ab", "ab bc", "yyyy", "yyyy pw hhhh", "wstjz"),
  stringsAsFactors=FALSE
)

comps = function(a.row){
  paste(stringdist(unlist(strsplit(as.character(a.row[1]), ' ')), 
                   unlist(strsplit(as.character(a.row[2]), ' '))), 
        collapse = ' ')
  
}
df %>%
  mutate(col2 = lead(col1, 1)) %>%
         mutate(distance = apply(., 1, comps))
  1. there should be a way to not have to use the as.character in the strsplit function应该有一种方法as.characterstrsplit函数中使用strsplit
  2. I'm not sure that you can have a column of vectors in a dataframe, this might be why all the warnings and the character type for the distance.我不确定您是否可以在数据框中有一列向量,这可能是所有警告和距离字符类型的原因。 I here cast the distance into a string to keep all the values in the same column.我在这里将距离转换为字符串以将所有值保留在同一列中。

how about something like this:这样的事情怎么样:

mydf <- data.frame(
  col1 = c("ab", "ab bc", "yyyy", "yyyy pw hhhh", "wstjz")
)
mydf


library(dplyr)
library(stringdist)
mydf %>% 
  mutate(col1_lead = lead(col1)) %>% 
  apply(1, function(x){
    stringdist(
      unlist(strsplit(x["col1"], " ")), 
      unlist(strsplit(x["col1_lead"], " "))
    )}
  ) %>% 
  cbind() %>% 
  `colnames<-`("distance") %>% 
  cbind(mydf)

Below is my simple honesty idea.以下是我简单的诚实想法。
I make list-cols having words and calculate dist row by row with unlist (because stringdist need vector).我制作了包含单词的 list-cols 并使用unlist逐行计算 dist (因为 stringdist 需要向量)。 And keep the dist as list-column.并将 dist 保留为列表列。

ans <- df %>%
  as_tibble() %>% 
  mutate(id = row_number(),   # not use
         col2 = lead(col1, 1),
         sep_col1 = str_split(col1, " "),
         sep_col2 = str_split(col2, " ")) %>%    # or str_split(lead(col1, 1))
  rowwise() %>% 
  mutate(dist = list(stringdist(unlist(sep_col1), unlist(sep_col2))),
         for_just_look = paste(dist, collapse = ", ")) %>% 
  ungroup()

ans

#  col1            id col2         sep_col1  sep_col2  dist     for_just_look
#  <chr>        <int> <chr>        <list>    <list>    <list>    <chr>   
# 1 ab               1 ab bc        <chr [1]> <chr [2]> <dbl [2]> 0, 2    
# 2 ab bc            2 yyyy         <chr [2]> <chr [1]> <dbl [2]> 4, 4    
# 3 yyyy             3 yyyy pw hhhh <chr [1]> <chr [3]> <dbl [3]> 0, 4, 4 
# 4 yyyy pw hhhh     4 wstjz        <chr [3]> <chr [1]> <dbl [3]> 5, 5, 5 
# 5 wstjz            5 NA           <chr [1]> <chr [1]> <dbl [1]> NA      

Without my comments below, just straightforward would be this.如果没有我在下面的评论,这就是直截了当的。

library(data.table)
setDT(df)

df[, col1 := list(str_split(col1, " "))]
df[, col2 := lead(col1, 1)]
df[, distance := lapply(.I, function(x) { stringdist(col1[x][[1]], col2[x][[1]]) })]

Be carefull with any stringdist like function, on a huge dataset it is quite intense to make all comparisons.小心任何类似 stringdist 的函数,在庞大的数据集上进行所有比较是非常激烈的。 Also keep in mind what you are going to use the values distances for.还要记住你将使用值距离做什么。 Are you truly intestested in the disctance?你真的对距离感兴趣吗? Or are you interested in like all with a distance < x ?或者您是否对所有距离 < x 感兴趣? If so most likely a compared to axxxxxxxxxxxxxxx you do not consider a close match right, but you could see that difference by the length of the string for example which takes way less resources to calculate than the actual distance.如果与 axxxxxxxxxxxxxxx 相比,很可能 axxxxxxxxxxxxxxx 你不认为接近匹配,但是你可以通过字符串的长度看到这种差异,例如,它比实际距离需要更少的资源来计算。

Also it would be a waste of computation to blindly compute row by row, lets just make a tiny longer sample set.盲目地逐行计算也是一种计算浪费,让我们制作一个稍长的样本集。

c("ab", "ab bc", "yyyy", "yyyy pw hhhh", "yyyy", "yyyy pw hhhh", "wstjz", "wstjz")

here you would calculate 3x the disctance between yyyy and yyyy which should be done once (well actually you should capture those by "is equal" first), 3x yyyy and hhhh / hhhh and yyyy.在这里,您将计算 yyyy 和 yyyy 之间的距离应该执行一次(实际上您应该首先通过“相等”捕获它们),3x yyyy 和 hhhh / hhhh 和 yyyy。

With a small dataset you probably do not have to worry, but with large sets and longer strings... it can become messy / slow pretty fast.对于小数据集,您可能不必担心,但是对于大数据集和更长的字符串......它可能会很快变得混乱/缓慢。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM