R：从第一个字符到字符串结尾的正则表达式

Question

I have strings like these here: 我在这里有这样的字符串：

a <- "-en eller -et eller (uofficielt) -'en eller (uofficielt) -'et"
b <- "-ten, -ter, -terne"

And I would like to use regular expressions in R to extract the text from the "-" to the first non-character, thus get: 而且我想在R中使用正则表达式将文本从“-”提取到第一个非字符，从而得到：

en et 'en 'et
ten ter terne

I have found a solution, but it just does not feel very satisfying or elegant 我找到了一个解决方案，但感觉并不令人满意或优雅

a <- unlist(strsplit(a, " |,"))
a <- a[grep("-", a)]
a <- gsub("-", "", a)

b <- unlist(strsplit(b, " |,"))
b <- b[grep("-", b)]
b <- gsub("-", "", b)

Do you have a suggesting for a more elegant one-liner that extracts all the endings I want? 您是否有建议提出一种更优雅的单线提取我想要的所有结局？

Answer 1

I think you need to match a - that is not preceded with a word char (that is, not match when it is part of a compound word), and there is an optional ' after the hyphen, that is then followed with 1+ word chars. 我认为您需要匹配一个- ，但不带单词char（也就是说，当它是复合单词的一部分时不匹配），并且在连字符后有一个可选的' ，然后是1+单词字符 Then, you can use 然后，您可以使用

a <- "-en eller -et eller (uofficielt) -'en eller (uofficielt) -'et"
b <- "-ten, -ter, -terne"
pat <- "\\B-\\K'?\\w+"
res_a <- regmatches(a, gregexpr(pat, a, perl=TRUE))
unlist(res_a)
## [1] "en"  "et"  "'en" "'et"
res_b <- regmatches(b, gregexpr(pat, b, perl=TRUE))
unlist(res_b)
## [1] "ten"   "ter"   "terne"

See the online R demo 观看在线R演示

Pattern details : 图案细节 ：

\\\\B - a non-word boundary \\\\B非单词边界
- - a hyphen -连字符
\\\\K - match reset operator \\\\K匹配重置运算符
'? - an optional ' -可选的'
\\\\w+ - 1 or more letters, digits or _ \\\\w+ -1个或多个字母，数字或_

Answer 2

We can use str_extract 我们可以使用str_extract

library(stringr)
str_extract_all(a, '(?<=-)[^, ]+')[[1]]
#[1] "en"  "et"  "'en" "'et"


str_extract_all(b, '(?<=-)[^, ]+')[[1]]
#[1] "ten"   "ter"   "terne"

Answer 3

If you want to keep it in base R, I do not not think you will get it much more elegant that what you have (and you can always make that a one-liner). 如果您想将其保留在基数R中，我不认为您会比拥有的东西更加优雅（并且您可以始终使它成为单线）。 The value argument of grep might help you a bit as below. grep的value参数可能对您有所帮助，如下所示。

Maybe 也许

substring(grep("-'?\\w", strsplit(a, " ")[[1]], value = TRUE), 2)

or 要么

gsub("-", "", grep("-'?\\w", strsplit(a, " ")[[1]], value = TRUE)

can be considered slightly more elegant. 可以算是稍微优雅一点。

R：从第一个字符到字符串结尾的正则表达式

问题描述

3 个解决方案

解决方案1
2 已采纳 2017-07-12 11:42:24

解决方案2
1 2017-07-12 11:32:25

解决方案3
1 2017-07-12 11:43:27

R：从第一个字符到字符串结尾的正则表达式

问题描述

3 个解决方案

解决方案1 2 已采纳 2017-07-12 11:42:24

解决方案2 1 2017-07-12 11:32:25

解决方案3 1 2017-07-12 11:43:27

解决方案1
2 已采纳 2017-07-12 11:42:24

解决方案2
1 2017-07-12 11:32:25

解决方案3
1 2017-07-12 11:43:27