简体   繁体   中英

Positive lookahead in R

Novice on regular expressions here ...

Assume the following names:

names <- c("Jackson, Michael", "Lennon, John", "Obama, Barack")

I want to split the names, as to retain all the characters up to and including the first letter of the first name. Thus, the results would look this:

Jackson, M
Lennon, J
Obama, B

I know this is a simple solution, but I am stuck on specifying what seems to be a reasonable solution -- that is, a positive lookahead regex. I am specifying a match based on the comma, the space, and the first letter in caps. This is what I have but obviously it is wrong:

names.reduced <- gsub("(?=\\,\\s[A-Z]).*", "", names)

(?= ... ) is a zero-width assertion which does not consume any characters on the string.

It only matches a position in the string. The point of zero-width is the validation to see if a regular expression can or cannot be matched looking ahead from the current position, without adding to the overall match. In this case, using a lookahead assertion is not necessary at all.

You can do this using a capture group, backreferencing the group inside the replacement call.

sub('(.*[A-Z]).*', '\\1', names)
# [1] "Jackson, M" "Lennon, J"  "Obama, B"

Or better yet, you can use negation to remove all except A to Z at the end of the string.

sub('[^A-Z]*$', '', names)
# [1] "Jackson, M" "Lennon, J"  "Obama, B"

You can use a lookbehind instead of the lookahead assertion

sub('(?<=, [A-Z]).*$', '', names, perl=TRUE)
#[1] "Jackson, M" "Lennon, J"  "Obama, B"  

You could use regmatches function also.

> names <- c("Jackson, Michael", "Lennon, John", "Obama, Barack")
> regmatches(names, regexpr(".*,\\s*[A-Z]", names))
[1] "Jackson, M" "Lennon, J"  "Obama, B"

OR

> library(stringi)
> stri_extract(names, regex=".*,\\s*[A-Z]")
[1] "Jackson, M" "Lennon, J"  "Obama, B"  

OR

Just match all the chars upto the last uppercase letter.

> stri_extract(names, regex=".*[A-Z]")
[1] "Jackson, M" "Lennon, J"  "Obama, B"  

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM