简体   繁体   English

R:从第一个字符到字符串结尾的正则表达式

[英]R: regex from first character to the end of the string

I have strings like these here: 我在这里有这样的字符串:

a <- "-en eller -et eller (uofficielt) -'en eller (uofficielt) -'et"
b <- "-ten, -ter, -terne"

And I would like to use regular expressions in R to extract the text from the "-" to the first non-character, thus get: 而且我想在R中使用正则表达式将文本从“-”提取到第一个非字符,从而得到:

en et 'en 'et
ten ter terne

I have found a solution, but it just does not feel very satisfying or elegant 我找到了一个解决方案,但感觉并不令人满意或优雅

a <- unlist(strsplit(a, " |,"))
a <- a[grep("-", a)]
a <- gsub("-", "", a)

b <- unlist(strsplit(b, " |,"))
b <- b[grep("-", b)]
b <- gsub("-", "", b)

Do you have a suggesting for a more elegant one-liner that extracts all the endings I want? 您是否有建议提出一种更优雅的单线提取我想要的所有结局?

I think you need to match a - that is not preceded with a word char (that is, not match when it is part of a compound word), and there is an optional ' after the hyphen, that is then followed with 1+ word chars. 我认为您需要匹配一个- ,但不带单词char(也就是说,当它是复合单词的一部分时不匹配),并且在连字符后有一个可选的' ,然后是1+单词字符 Then, you can use 然后,您可以使用

a <- "-en eller -et eller (uofficielt) -'en eller (uofficielt) -'et"
b <- "-ten, -ter, -terne"
pat <- "\\B-\\K'?\\w+"
res_a <- regmatches(a, gregexpr(pat, a, perl=TRUE))
unlist(res_a)
## [1] "en"  "et"  "'en" "'et"
res_b <- regmatches(b, gregexpr(pat, b, perl=TRUE))
unlist(res_b)
## [1] "ten"   "ter"   "terne"

See the online R demo 观看在线R演示

Pattern details : 图案细节

  • \\\\B - a non-word boundary \\\\B非单词边界
  • - - a hyphen -连字符
  • \\\\K - match reset operator \\\\K匹配重置运算符
  • '? - an optional ' -可选的'
  • \\\\w+ - 1 or more letters, digits or _ \\\\w+ -1个或多个字母,数字或_

We can use str_extract 我们可以使用str_extract

library(stringr)
str_extract_all(a, '(?<=-)[^, ]+')[[1]]
#[1] "en"  "et"  "'en" "'et"


str_extract_all(b, '(?<=-)[^, ]+')[[1]]
#[1] "ten"   "ter"   "terne"

If you want to keep it in base R, I do not not think you will get it much more elegant that what you have (and you can always make that a one-liner). 如果您想将其保留在基数R中,我不认为您会比拥有的东西更加优雅(并且您可以始终使它成为单线)。 The value argument of grep might help you a bit as below. grep的value参数可能对您有所帮助,如下所示。

Maybe 也许

substring(grep("-'?\\w", strsplit(a, " ")[[1]], value = TRUE), 2)

or 要么

gsub("-", "", grep("-'?\\w", strsplit(a, " ")[[1]], value = TRUE)

can be considered slightly more elegant. 可以算是稍微优雅一点。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM