简体   繁体   English

在 R 中使用模糊 LR 模式提取字符串

[英]Extract strings using fuzzy LR patterns in R

I am struggling for long time.我挣扎了很久。

I manage to extract everything between my Right and Left patterns in a string as you can see in the following example.正如您在以下示例中所见,我设法在字符串中提取左右模式之间的所有内容。

library(tidyverse)

data=c("everything will be ok one day")

str_extract(string = data, pattern = "(?<=thing).*(?=ok one)")
#> [1] " will be "

Created on 2022-01-26 by the reprex package (v2.0.1)代表 package (v2.0.1) 于 2022 年 1 月 26 日创建

As you notice in the code, I extract everything between "thing" and "ok one".正如您在代码中注意到的那样,我提取了“事物”和“确定”之间的所有内容。

I need to incorporate the possibility of mismatches inside these patterns.我需要在这些模式中加入不匹配的可能性。 I want to allow a maximum of two mismatches and consider indels and insertions.我想最多允许两个不匹配并考虑插入和插入。


Example例子

for example one mismatch that I want to account for is the insertion of letter "s" in everything例如,我想解释的一个不匹配是在所有内容中插入字母“s”

dat.1=c("everythings will be ok one day")

I would like in this case to be able to extract the the phrase在这种情况下,我希望能够提取短语

will be 

PS: This is just a simplified example. PS:这只是一个简化的例子。 My actual data does not contain gaps, and it's complicated.我的实际数据不包含空白,而且很复杂。 I am looking forward to receiving your help and guidance.我期待得到您的帮助和指导。

One way is to use fuzzy matching of strings, relying, for instance, on package stringdist and computing, for each delimiter string ( thing and ok , in your example), the respective matching score (that is what the function maxsim does below).一种方法是使用字符串的模糊匹配,例如,依靠 package stringdist和计算,对于每个分隔符字符串(在您的示例中为thingok ),各自的匹配分数(这就是 function maxsim在下面所做的)。

library(tidyverse)
library(stringdist)

dat.1=c("everythings will be ok one day")

maxsim <- function(df, delim)
{
  df %>% 
    str_split(" ") %>% unlist %>% 
    map(~ stringsim(delim,.x)) %>% 
    which.max
} 

dat.1 %>% 
  str_split(" ") %>% unlist %>% 
  .[ (maxsim(dat.1,"thing") + 1) : (maxsim(dat.1,"ok") - 1) ] %>% 
  str_c(collapse = " ")

#> [1] "will be"

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM