[英]Determine (dis)similarity of multi-word strings on a word-by-word basis
I'm working on string distance in multi-word strings, as in this toy data:我正在研究多字字符串中的字符串距离,就像在这个玩具数据中一样:
df <- data.frame(
col1 = c("ab", "ab bc", "yyyy", "yyyy pw hhhh", "wstjz")
)
I'd like to determine the (dis)similarity of each row compared to the next row on a word-by-word basis .我想逐字确定每一行与下一行的(不同)相似性。 I use this code:我使用这个代码:
library(dplyr)
library(tidyr)
library(stringdist)
df %>%
mutate(col2 = lead(col1, 1),
id = row_number()) %>%
pivot_longer(
# select columns:
cols = c(col1, col2),
# determine name of new column:
names_to = c(".value", "Col_N"),
# define capture groups (...) for new column:
names_pattern = "^([a-z]+)([0-9])$") %>%
# separate each word into its own row:
separate_rows(col, sep = "\\s") %>%
# recast into wider format:
pivot_wider(id_cols = c(id, Col_N),
names_from = Col_N,
values_from = col) %>%
# unnest lists:
unnest(.) %>%
# calculate string distance:
mutate(distance = stringdist(`1`, `2`)) %>%
group_by(id) %>%
# reconnect same-string words and distance values:
summarise(col1 = str_c(unique(`1`), collapse = " "),
col2 = str_c(unique(`2`), collapse = " "),
distance = str_c(distance, collapse = ", "))
# A tibble: 5 x 4
id col1 col2 distance
* <int> <chr> <chr> <chr>
1 1 ab ab bc 0, 2
2 2 ab bc yyyy 4, 4
3 3 yyyy yyyy pw hhhh 0, 4, 4
4 4 yyyy pw hhhh wstjz 5, 5, 5
5 5 wstjz NA NA
While the result seems to be okay, there are three problems with it: a) there are a number of warnings , b) the code seems quite convoluted , and c) distance
is of type character.虽然结果似乎没问题,但它存在三个问题:a) 有很多警告,b) 代码看起来很复杂,c) distance
是字符类型。 So I'm wondering if there's a better way to determine word-by-word the (dis)similiarity of strings?所以我想知道是否有更好的方法来逐字确定字符串的(dis)相似性?
A solution:一个办法:
df <- data.frame(
col1 = col1 = c("ab", "ab bc", "yyyy", "yyyy pw hhhh", "wstjz"),
stringsAsFactors=FALSE
)
comps = function(a.row){
paste(stringdist(unlist(strsplit(as.character(a.row[1]), ' ')),
unlist(strsplit(as.character(a.row[2]), ' '))),
collapse = ' ')
}
df %>%
mutate(col2 = lead(col1, 1)) %>%
mutate(distance = apply(., 1, comps))
as.character
in the strsplit
function应该有一种方法as.character
在strsplit
函数中使用strsplit
how about something like this:这样的事情怎么样:
mydf <- data.frame(
col1 = c("ab", "ab bc", "yyyy", "yyyy pw hhhh", "wstjz")
)
mydf
library(dplyr)
library(stringdist)
mydf %>%
mutate(col1_lead = lead(col1)) %>%
apply(1, function(x){
stringdist(
unlist(strsplit(x["col1"], " ")),
unlist(strsplit(x["col1_lead"], " "))
)}
) %>%
cbind() %>%
`colnames<-`("distance") %>%
cbind(mydf)
Below is my simple honesty idea.以下是我简单的诚实想法。
I make list-cols having words and calculate dist row by row with unlist
(because stringdist need vector).我制作了包含单词的 list-cols 并使用unlist
逐行计算 dist (因为 stringdist 需要向量)。 And keep the dist as list-column.并将 dist 保留为列表列。
ans <- df %>%
as_tibble() %>%
mutate(id = row_number(), # not use
col2 = lead(col1, 1),
sep_col1 = str_split(col1, " "),
sep_col2 = str_split(col2, " ")) %>% # or str_split(lead(col1, 1))
rowwise() %>%
mutate(dist = list(stringdist(unlist(sep_col1), unlist(sep_col2))),
for_just_look = paste(dist, collapse = ", ")) %>%
ungroup()
ans
# col1 id col2 sep_col1 sep_col2 dist for_just_look
# <chr> <int> <chr> <list> <list> <list> <chr>
# 1 ab 1 ab bc <chr [1]> <chr [2]> <dbl [2]> 0, 2
# 2 ab bc 2 yyyy <chr [2]> <chr [1]> <dbl [2]> 4, 4
# 3 yyyy 3 yyyy pw hhhh <chr [1]> <chr [3]> <dbl [3]> 0, 4, 4
# 4 yyyy pw hhhh 4 wstjz <chr [3]> <chr [1]> <dbl [3]> 5, 5, 5
# 5 wstjz 5 NA <chr [1]> <chr [1]> <dbl [1]> NA
Without my comments below, just straightforward would be this.如果没有我在下面的评论,这就是直截了当的。
library(data.table)
setDT(df)
df[, col1 := list(str_split(col1, " "))]
df[, col2 := lead(col1, 1)]
df[, distance := lapply(.I, function(x) { stringdist(col1[x][[1]], col2[x][[1]]) })]
Be carefull with any stringdist like function, on a huge dataset it is quite intense to make all comparisons.小心任何类似 stringdist 的函数,在庞大的数据集上进行所有比较是非常激烈的。 Also keep in mind what you are going to use the values distances for.还要记住你将使用值距离做什么。 Are you truly intestested in the disctance?你真的对距离感兴趣吗? Or are you interested in like all with a distance < x ?或者您是否对所有距离 < x 感兴趣? If so most likely a compared to axxxxxxxxxxxxxxx you do not consider a close match right, but you could see that difference by the length of the string for example which takes way less resources to calculate than the actual distance.如果与 axxxxxxxxxxxxxxx 相比,很可能 axxxxxxxxxxxxxxx 你不认为接近匹配,但是你可以通过字符串的长度看到这种差异,例如,它比实际距离需要更少的资源来计算。
Also it would be a waste of computation to blindly compute row by row, lets just make a tiny longer sample set.盲目地逐行计算也是一种计算浪费,让我们制作一个稍长的样本集。
c("ab", "ab bc", "yyyy", "yyyy pw hhhh", "yyyy", "yyyy pw hhhh", "wstjz", "wstjz")
here you would calculate 3x the disctance between yyyy and yyyy which should be done once (well actually you should capture those by "is equal" first), 3x yyyy and hhhh / hhhh and yyyy.在这里,您将计算 yyyy 和 yyyy 之间的距离应该执行一次(实际上您应该首先通过“相等”捕获它们),3x yyyy 和 hhhh / hhhh 和 yyyy。
With a small dataset you probably do not have to worry, but with large sets and longer strings... it can become messy / slow pretty fast.对于小数据集,您可能不必担心,但是对于大数据集和更长的字符串......它可能会很快变得混乱/缓慢。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.