R：正则表达式疯狂（stringi）

Question

I have a vector of strings that look like this: 我有一个向量像这样的字符串：

G30(H).G3(M).G0(L).Replicate(1)

Iterating over c("H", "M", "L") , I would like to extract G30 (for " H "), G3 (for " M ") and G0 (for " L "). 迭代c("H", "M", "L") ，我想提取G30 （对于“ H ”）， G3 （对于“ M ”）和G0 （对于“ L ”）。

My various attempts have me confused - the regex101.com debugger, eg indicates that (\\w*)\\(M\\) works just fine, but transferring that to R fails ... 我的种种尝试使我感到困惑regex101.com调试器，例如表明(\\w*)\\(M\\)正常，但是将其传输到R失败...

Answer 1

Using the stringi package and the outer() function: 使用stringi包和outer()函数：

library(stringi)

strings <- c(
  "G30(H).G3(M).G0(L).Replicate(1)",
  "G5(M).G11(L).G6(H).Replicate(9)",
  "G10(M).G6(H).G8(M).Replicate(200)"  # No "L", repeated "M"
)
targets  <- c("H", "M", "L")
patterns <- paste0("\\w+(?=\\(", targets, "\\))")
matches  <- outer(strings, patterns, FUN = stri_extract_first_regex)
colnames(matches) <- targets
matches
#      H     M    L    
# [1,] "G30" "G3" "G0" 
# [2,] "G6"  "G5" "G11"
# [3,] "G6"  "G10" NA

This ignores any instances of a target letter past the first, gives you an NA when the target's not found, and returns everything in a simple matrix. 这将忽略目标字母后面的任何实例，在找不到目标字母时为您提供NA ，并以简单矩阵形式返回所有内容。 The regular expressions stored in patterns match substrings like XX(Y) , where Y is the target letter and XX is any number of word characters. 存储在patterns中的正则表达式匹配诸如XX(Y)子字符串，其中Y是目标字母，而XX是任意数量的单词字符。

Answer 2

I am pretty sure there are better solutions, but this works... 我敢肯定，有更好的解决方案，但这行得通...

jnk <- 'G30(H).G3(M).G0(L).Replicate(1)'
patter <- '([^\\(]+)\\(H\\)\\.([^\\(]+)\\(M\\)\\.([^\\(]+)\\(L\\)\\.Replicate\\(\\d+\\)'
H <- sub(patter, '\\1', jnk)
M <- sub(patter, '\\2', jnk)
L <- sub(patter, '\\3', jnk)

EDIT: 编辑：

Actually, I found once a very nice function parse.one which makes it possible to search more in a python like regular expression way... 实际上，我曾经发现一个很好的函数parse.one ，它使得像正则表达式一样可以在python中搜索更多...

Have a look at this: 看看这个：

parse.one <- function(res, result) {
  m <- do.call(rbind, lapply(seq_along(res), function(i) {
    if(result[i] == -1) return("")
    st <- attr(result, "capture.start")[i, ]
    substring(res[i], st, st + attr(result, "capture.length")[i, ] - 1)
  }))
  colnames(m) <- attr(result, "capture.names")
  m
}
jnk <- 'G30(H).G3(M).G0(L).Replicate(1)'
pattern <- '(?<H>[^\\(]+)\\(H\\)\\.(?<M>[^\\(]+)\\(M\\)\\.(?<L>[^\\(]+)\\(L\\)\\.Replicate\\(\\d+\\)'
parse.one(jnk, regexpr(pattern, jnk, perl=TRUE))

Result looks like this: 结果如下：

> parse.one(jnk, regexpr(pattern, jnk, perl=TRUE))
     H     M    L   
[1,] "G30" "G3" "G0"

Answer 3

If the order is always the same, an alternative might be to split the strings. 如果顺序始终相同，则可以选择拆分字符串。 For instance: 例如：

string <- "G30(H).G3(M).G0(L).Replicate(1)"
tmp <- str_split(string, "\\.")[[1]]
lapply(tmp[1:3], function(x) str_split(x, "\\(")[[1]][1])
[[1]]
[1] "G30"

[[2]]
[1] "G3"

[[3]]
[1] "G0"

Answer 4

If codes (eg, 'G30') preceding the tags(eg, '(H).') or the order of the tags in the string are allowed to change (different letters or length), you may want to try a more flexible solution based on regexpr() . 如果允许更改标签（例如“（H）”）前面的代码（例如“ G30”）或字符串中标签的顺序（不同的字母或长度），则可以尝试使用更灵活的方法基于regexpr（）的解决方案。

aa <-paste("G30(H).G3(M).G0(L).Replicate(",1:10,")", sep="")
my.tags <- c("H","M", "L")

extr.data <- lapply(my.tags, (function(tag){
  pat <-  paste("\\(", tag, "\\)\\.", sep="")
  pos <- regexpr(paste("(^|\\.)([[:alnum:]])*", pat ,sep=""), aa)
  out <- substr(aa, pos, (pos+attributes(pos)$match.length - 4 - length(tag)))  
  gsub("(^\\.)", "", out) 
}))
names(extr.data) <- my.tags
extr.data

Answer 5

I'm going to assume that the functions (G...) are variable and the inputs are variable. 我将假设函数（G ...）是变量，输入是变量。 This does assume that your functions start with a G and your input is always a letter. 这确实假设您的函数以G开头，并且您的输入始终为字母。

parse = function(arb){
  tmp = stringi::stri_extract_all_regex(arb,"G.*?\\([A-Z]\\)")[[1]]
  unlist(lapply(lapply(tmp,strsplit,"\\)|\\("),function(x){
    output = x[[1]][1]
    names(output) = x[[1]][2]
    return(output)
  }))
}

This first parses out all the G functions with their inputs. 这首先解析所有G函数及其输入。 Then, each of those is split into their function part and their input part. 然后，将它们分别分解为功能部分和输入部分。 This is the put into a character vector of functions named for their input. 这是将以其输入命名的函数放入一个字符向量中。

parse("G30(H).G3(M).G0(L).Replicate(1)")
>     H     M     L 
  "G30"  "G3"  "G0"

Or 要么

parse("G35(L).G31(P).G02(K).Replicate(1)")
>     L     P     K 
  "G35" "G31" "G02"

R：正则表达式疯狂（stringi）

问题描述

5 个解决方案

解决方案1
2 已采纳 2017-08-15 15:09:44

解决方案2
1 2017-08-15 11:49:50

解决方案3
1 2017-08-15 12:30:36

解决方案4
1 2017-08-15 13:07:46

解决方案5
1 2017-08-15 13:14:17

R：正则表达式疯狂（stringi）

问题描述

5 个解决方案

解决方案1 2 已采纳 2017-08-15 15:09:44

解决方案2 1 2017-08-15 11:49:50

解决方案3 1 2017-08-15 12:30:36

解决方案4 1 2017-08-15 13:07:46

解决方案5 1 2017-08-15 13:14:17

解决方案1
2 已采纳 2017-08-15 15:09:44

解决方案2
1 2017-08-15 11:49:50

解决方案3
1 2017-08-15 12:30:36

解决方案4
1 2017-08-15 13:07:46

解决方案5
1 2017-08-15 13:14:17