[英]Extract strings using fuzzy LR patterns in R
I am struggling for long time.我挣扎了很久。
I manage to extract everything between my Right and Left patterns in a string as you can see in the following example.正如您在以下示例中所见,我设法在字符串中提取左右模式之间的所有内容。
library(tidyverse)
data=c("everything will be ok one day")
str_extract(string = data, pattern = "(?<=thing).*(?=ok one)")
#> [1] " will be "
Created on 2022-01-26 by the reprex package (v2.0.1)由代表 package (v2.0.1) 于 2022 年 1 月 26 日创建
As you notice in the code, I extract everything between "thing" and "ok one".正如您在代码中注意到的那样,我提取了“事物”和“确定”之间的所有内容。
I need to incorporate the possibility of mismatches inside these patterns.我需要在这些模式中加入不匹配的可能性。 I want to allow a maximum of two mismatches and consider indels and insertions.
我想最多允许两个不匹配并考虑插入和插入。
Example例子
for example one mismatch that I want to account for is the insertion of letter "s" in everything例如,我想解释的一个不匹配是在所有内容中插入字母“s”
dat.1=c("everythings will be ok one day")
I would like in this case to be able to extract the the phrase在这种情况下,我希望能够提取短语
will be
PS: This is just a simplified example. PS:这只是一个简化的例子。 My actual data does not contain gaps, and it's complicated.
我的实际数据不包含空白,而且很复杂。 I am looking forward to receiving your help and guidance.
我期待得到您的帮助和指导。
One way is to use fuzzy matching of strings, relying, for instance, on package stringdist
and computing, for each delimiter string ( thing
and ok
, in your example), the respective matching score (that is what the function maxsim
does below).一种方法是使用字符串的模糊匹配,例如,依靠 package
stringdist
和计算,对于每个分隔符字符串(在您的示例中为thing
和ok
),各自的匹配分数(这就是 function maxsim
在下面所做的)。
library(tidyverse)
library(stringdist)
dat.1=c("everythings will be ok one day")
maxsim <- function(df, delim)
{
df %>%
str_split(" ") %>% unlist %>%
map(~ stringsim(delim,.x)) %>%
which.max
}
dat.1 %>%
str_split(" ") %>% unlist %>%
.[ (maxsim(dat.1,"thing") + 1) : (maxsim(dat.1,"ok") - 1) ] %>%
str_c(collapse = " ")
#> [1] "will be"
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.