简体   繁体   English

在 R 中查找字符串组之间的距离

[英]Find the distance between groups of string in R

I have a very large dataset, which looks like this.我有一个非常大的数据集,看起来像这样。

I have two types of data frames我有两种类型的数据框

  1. my reference data.frame我的参考数据框架
ref=c("cake","brownies")

and my experimental data.frame和我的实验data.frame

expr=c("cak","cakee","cake", "rownies","browwnies")

I want to match the ref and expr data.frames and find the levenstein distance between them.我想匹配refexpr data.frames 并找到它们之间的 levenstein 距离。 The output could look like this... output 可能看起来像这样......

ref   expr      distance 
cake  cak         1
cake  cakee       1
cake  cake        0
cake  rownies    ...

after I have measured their levenstein distance I want to cluster any string that has distance less than 3 to one cluster and my data to maybe look like在我测量了他们的列文斯坦距离之后,我想将距离小于 3 的任何字符串聚类到一个聚类,我的数据可能看起来像

ref        expr      distance  cluster
cake       cak         1         1
cake       cakee       1         1
cake       cake        0         1
brownies   rownies     1         2 
brownies   browwnies   1         2

any help or advice on how to move on is appreciate it.任何有关如何继续前进的帮助或建议都将不胜感激。 At the moment I am trying a lot of R packages to find the distance between data.frame such as目前我正在尝试很多 R 包来查找 data.frame 之间的距离,例如

library("DescTools")

but they do not seem to work well.但它们似乎效果不佳。

Here are 2 ways I'd approach it, one that's strictly supervised and more manual, and another that takes a less supervised route.这里有两种方法我会接近它,一种是严格监督和更多手动的,另一种是监督较少的路线。 The package stringdist has a bunch of different distance metrics, where "lv" is Levenshtein. package stringdist有一堆不同的距离度量,其中"lv"是 Levenshtein。 I added an additional observation "poundcake" to test with a word that's too far from the reference words.我添加了一个额外的观察“磅蛋糕”来测试一个离参考词太远的词。

Option 1选项1

Get a matrix of the distances between each experimental string and one of the reference strings.获取每个实验字符串与其中一个参考字符串之间的距离矩阵。 This could have issues if you have 2 similar reference strings, or if an experimental word is equally close to 2 references, but it works for this simple case.如果您有 2 个相似的参考字符串,或者如果一个实验词同样接近 2 个参考,这可能会出现问题,但它适用于这种简单的情况。 Then reshape the matrix into a data frame, and count along reference words to get cluster numbers.然后将矩阵重塑为数据框,并沿参考词计数以获得簇数。 Filter for cases where the distance is less than your threshold.过滤距离小于阈值的情况。

library(dplyr)
library(stringdist)

max_dist <- 3

ref <- c("cake", "brownies")
expr <- c("cak", "cakee", "cake", "poundcake", "rownies","browwnies")

mtx <- stringdistmatrix(ref, expr, method = "lv", useNames = "strings")

mtx
#>          cak cakee cake poundcake rownies browwnies
#> cake       1     1    0         5       6         8
#> brownies   8     7    7         8       1         1

df1 <- as.data.frame(mtx) %>%
  tibble::rownames_to_column("ref") %>%
  tidyr::pivot_longer(-ref, names_to = "expr", values_to = "dist") %>%
  mutate(clust = as.numeric(forcats::as_factor(ref))) # could also use data.table::rleid

df1 %>%
  filter(dist <= max_dist)
#> # A tibble: 5 × 4
#>   ref      expr       dist clust
#>   <chr>    <chr>     <dbl> <dbl>
#> 1 cake     cak           1     1
#> 2 cake     cakee         1     1
#> 3 cake     cake          0     1
#> 4 brownies rownies       1     2
#> 5 brownies browwnies     1     2

Option 2选项 2

This might work for more complex cases.这可能适用于更复杂的情况。 I've used it for correcting the spelling of people's names, where I have an incomplete set of correct labels to work from.我用它来纠正人名的拼写,在那里我有一组不完整的正确标签可供使用。 Combine all the words into 1 vector, get a distance matrix (this time it will be square), then create clusters from hierarchical clustering using the threshold as the height to cut the tree.将所有单词组合成 1 个向量,得到一个距离矩阵(这次它将是正方形),然后使用阈值作为高度从层次聚类创建聚类以切割树。 You can then match the reference for each word to get labels for the clusters.然后,您可以匹配每个单词的参考以获取集群的标签。

The downside here is that you have rows for reference words that weren't experimental—note for example that "brownies" was never spelled correctly in the experimental strings, but now you have that observation.这里的缺点是你有一些非实验性的参考词行——例如,“brownies”在实验性字符串中从未正确拼写,但现在你有了观察结果。

all_words <- c(ref, expr)
hc <- hclust(stringdistmatrix(all_words, method = "lv", useNames = "strings"))

df2 <- data.frame(word = c(ref, expr), 
                  clust = cutree(hc, h = max_dist)) %>%
  mutate(r = ref[clust])

df2 %>%
  filter(!is.na(r))
#>        word clust        r
#> 1      cake     1     cake
#> 2  brownies     2 brownies
#> 3       cak     1     cake
#> 4     cakee     1     cake
#> 5      cake     1     cake
#> 6   rownies     2 brownies
#> 7 browwnies     2 brownies

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM