在 R 中使用模糊 LR 模式提取字符串

Question

I am struggling for long time.我挣扎了很久。

I manage to extract everything between my Right and Left patterns in a string as you can see in the following example.正如您在以下示例中所见，我设法在字符串中提取左右模式之间的所有内容。

library(tidyverse)

data=c("everything will be ok one day")

str_extract(string = data, pattern = "(?<=thing).*(?=ok one)")
#> [1] " will be "

^{Created on 2022-01-26 by the reprex package (v2.0.1)}^{由代表 package (v2.0.1) 于 2022 年 1 月 26 日创建}

As you notice in the code, I extract everything between "thing" and "ok one".正如您在代码中注意到的那样，我提取了“事物”和“确定”之间的所有内容。

I need to incorporate the possibility of mismatches inside these patterns.我需要在这些模式中加入不匹配的可能性。 I want to allow a maximum of two mismatches and consider indels and insertions.我想最多允许两个不匹配并考虑插入和插入。

Example例子

for example one mismatch that I want to account for is the insertion of letter "s" in everything例如，我想解释的一个不匹配是在所有内容中插入字母“s”

dat.1=c("everythings will be ok one day")

I would like in this case to be able to extract the the phrase在这种情况下，我希望能够提取短语

will be

PS: This is just a simplified example. PS：这只是一个简化的例子。 My actual data does not contain gaps, and it's complicated.我的实际数据不包含空白，而且很复杂。 I am looking forward to receiving your help and guidance.我期待得到您的帮助和指导。

Answer 1

One way is to use fuzzy matching of strings, relying, for instance, on package stringdist and computing, for each delimiter string ( thing and ok , in your example), the respective matching score (that is what the function maxsim does below).一种方法是使用字符串的模糊匹配，例如，依靠 package stringdist和计算，对于每个分隔符字符串（在您的示例中为thing和ok ），各自的匹配分数（这就是 function maxsim在下面所做的）。

library(tidyverse)
library(stringdist)

dat.1=c("everythings will be ok one day")

maxsim <- function(df, delim)
{
  df %>% 
    str_split(" ") %>% unlist %>% 
    map(~ stringsim(delim,.x)) %>% 
    which.max
} 

dat.1 %>% 
  str_split(" ") %>% unlist %>% 
  .[ (maxsim(dat.1,"thing") + 1) : (maxsim(dat.1,"ok") - 1) ] %>% 
  str_c(collapse = " ")

#> [1] "will be"

在 R 中使用模糊 LR 模式提取字符串

问题描述

1 个解决方案

解决方案1
0 2022-01-27 00:13:22

在 R 中使用模糊 LR 模式提取字符串

问题描述

1 个解决方案

解决方案1 0 2022-01-27 00:13:22

解决方案1
0 2022-01-27 00:13:22