简体   繁体   English

在 R 中提取 substring

[英]extract substring in R

Suppose I have list of string "S[+229]EC[+57]VDSTDNSSK[+229]PSSEPTSHVAR" and need to get a vector of string that contains only numbers with bracket like eg.假设我有字符串"S[+229]EC[+57]VDSTDNSSK[+229]PSSEPTSHVAR"的列表,并且需要获取一个字符串向量,该向量仅包含带括号的数字,例如。 [+229][+57] . [+229][+57]

Is there a convenient way in R to do this? R中是否有方便的方法来执行此操作?

Using base R , then try it with使用base R ,然后尝试使用

> unlist(regmatches(s,gregexpr("\\[\\+\\d+\\]",s)))
[1] "[+229]" "[+57]"  "[+229]"

Or you can use或者你可以使用

> gsub(".*?(\\[.*\\]).*","\\1",gsub("\\].*?\\[","] | [",s))
[1] "[+229] | [+57] | [+229]"

We can use str_extract_all from stringr我们可以使用str_extract_all中的stringr

stringr::str_extract_all(x, "\\[\\+\\d+\\]")[[1]]
#[1] "[+229]" "[+57]"  "[+229]"

Wrap it in unique if you need only unique values.如果您只需要唯一值,请将其包装为unique值。


Similarly, in base R using regmatches and gregexpr同样,在基础 R 中使用regmatchesgregexpr

regmatches(x, gregexpr("\\[\\+\\d+\\]", x))[[1]]

data数据

x <- "S[+229]EC[+57]VDSTDNSSK[+229]PSSEPTSHVAR"

Seems like you want to remove the alphabetical characters, so好像你想删除字母字符,所以

gsub("[[:alpha:]]", "", x)

where [:alpha:] is the class of alphabetical (lower-case and upper-case) characters, [[:alpha:]] says 'match any single alphabetical character', and gsub() says substitute, globally, any alphabetical character with the empty string "" .其中[:alpha:]是字母(小写和大写)字符的 class, [[:alpha:]]表示“匹配任何单个字母字符”, gsub()表示全局替换任何字母字符用空字符串"" This seems better than trying to match bracketed numbers, which requires figuring out which characters need to be escaped with a (double!) \\ .这似乎比尝试匹配括号内的数字要好,后者需要弄清楚哪些字符需要用 (double!) \\进行转义。

If the intention is to return the unique bracketed numbers, then the approach is to extract the matches (rather than remove the unwanted characters).如果打算返回唯一的括号数字,那么方法是提取匹配项(而不是删除不需要的字符)。 Instead of using gsub() to substitute matches to a regular expression with another value, I'll use gregexpr() to identify the matches, and regmatches() to extract the matches.我不会使用gsub()将匹配项替换为具有另一个值的正则表达式,而是使用gregexpr()来识别匹配项,并使用regmatches()来提取匹配项。 Since numbers always occur in [] , I'll simplify the regular expression to match one or more ( + ) characters from the collection +[:digit:] .由于数字总是出现在[]中,我将简化正则表达式以匹配集合+[:digit:]中的一个或多个 ( + ) 字符。

> xx <- regmatches(x, gregexpr("[+[:digit:]]+", x))
> xx
[[1]]
[1] "+229" "+57"  "+229"

xx is a list of length equal to the length of x . xx是长度等于x长度的列表。 I'll write a function that, for any element of this list, makes the values unique, surrounds the values with [ and ] , and concatenates them我将编写一个 function ,对于此列表的任何元素,使值唯一,用[]包围值,并将它们连接起来

fun <- function(x)
    paste0("[", unique(x), "]", collapse = "")

This needs to be applied to each element of the list, and simplified to a vector, a task for sapply() .这需要应用于列表的每个元素,并简化为向量,这是sapply()的任务。

> sapply(xx, fun)
[1] "[+229][+57]"

A minor improvement is to use vapply() , so that the result is robust (always returning a character vector with length equal to x ) to zero-length inputs一个小的改进是使用vapply() ,因此结果对于零长度输入是稳健的(总是返回长度等于x的字符向量)

> x = character()
> xx <- regmatches(x, gregexpr("[+[:digit:]]+", x))
> sapply(xx, fun)               # Hey, this returns a list :(
list()
> vapply(xx, fun, "character")  # vapply() deals with 0-length inputs
character(0)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM