简体   繁体   中英

What R function to use for regex capture groups?

I am doing some text wrangling in R, and for a specific extraction I need to use a capture group. For some reason the base/stringr functions I am familiar with don't seem to support capture groups:

str_extract("abcd123asdc", pattern = "([0-9]{3}).+$") 
# Returns: "123asdc"

stri_extract(str = "abcd123asdc", regex = "([0-9]{3}).+$")
# Returns: "123asdc"

grep(x = "abcd123asdc", pattern = "([0-9]{3}).+$", value = TRUE)
# Returns: "abcd123asdc"

The usual googling for "R capture group regex" doesn't give any useful hits for solutions to this problem. Am I missing something, or are capture groups not implemented in R?

EDIT: So after trying to solution suggested in the comments, which works on a small example, it fails for my situation.

Note this is a text from the enron emails dataset, so doesn't contain sensitive information.

txt <- "Message-ID: <24216240.1075855687451.JavaMail.evans@thyme>
Date: Wed, 18 Oct 2000 03:00:00 -0700 (PDT)
From: phillip.allen@enron.com
To: leah.arsdall@enron.com
Subject: Re: test
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-From: Phillip K Allen
X-To: Leah Van Arsdall
X-cc: 
X-bcc: 
X-Folder: \\Phillip_Allen_Dec2000\\Notes Folders\\sent mail   
X-Origin: Allen-P
X-FileName: pallen.nsf

test successful.  way to go!!!"

sub("X-FileName:.+\n\n([\\W\\w]+)$", "\\1", txt)
# Returns all of "txt", not the capture group

Since we only have a single capture group, shouldn't the "\\1" capture it? I tested the regex with an online regex tester and it should be working. Also tried both \\n and \\n for the newlines. Any ideas?

Getting job done

You may always extract capture groups with stringr using str_match or str_match_all :

> result <- str_match(txt, "X-FileName:.+\n\n(?s)(.+)$")
> result[,2]
[1] "test successful.  way to go!!!"

Pattern details :

  • X-FileName: - a literal substring
  • .+ - any 1+ chars other than line break (since in ICU regex, a dot does not match a line break char)
  • \\n\\n - 2 newline symbols
  • (?s) - an inline DOTALL modifier (now, . that occurs to the right will match a line break char)
  • (.+) - Group 1 capturing any 1+ chars (incl. line breaks) up to
  • $ - the end of string.

Or you may use base R regmatches with regexec :

> result <- regmatches(txt, regexec("X-FileName:[^\n]+\n\n(.+)$", txt))
> result[[1]][2]
[1] "test successful.  way to go!!!"

See the online R demo . Here, a TRE regex is used (with regexec , one can't use PCRE regex unfortunately), so . will match any character including a line break char, thus, the pattern will look like X-FileName:[^\\n]+\\n\\n(.+)$ :

  • X-FileName: - a literal string
  • [^\\n]+ - 1+ chars other than newline
  • \\n\\n - 2 newlines
  • (.+) - any 1+ chars (including line break chars), as many as possible, up to
  • $ - the end of string.

A sub option can also be considered:

sub(".*X-FileName:[^\n]+\n\n", "", txt)
[1] "test successful.  way to go!!!"

See this R demo . Here, .* matches any 0+ chars, as many as possible (all the string), then backtracks to find X-FileName: substring, [^\\n]+ matches 1+ chars other than a newline, and then \\n\\n match 2 newlines.

Comparing peformance

Taking into account hwnd's comment , I added a TRE regex based sub option above, and it seems the fastest from all 4 options suggested, with str_match being almost as fast as my above sub code:

library(microbenchmark)

f1 <- function(text) { return(str_match(txt, "X-FileName:.+\n\n(?s)(.+)$")[,2]) }
f2 <- function(text) { return(regmatches(txt, regexec("X-FileName:[^\n]+\n\n(.+)$", txt))[[1]][2]) }
f3 <- function(text) { return(sub('(?s).*X-FileName:[^\n]+\\R+', '', txt, perl=TRUE)) }
f4 <- function(text) { return(sub('.*X-FileName:[^\n]+\n\n', '', txt)) }

> test <- microbenchmark( f1(txt), f2(txt), f3(txt), f4(txt), times = 500000 )
> test
Unit: microseconds
    expr    min     lq     mean median     uq       max neval  cld
 f1(txt) 21.130 24.451 28.08150 27.168 28.677 53796.565 5e+05  b  
 f2(txt) 29.280 32.903 37.46800 35.318 37.431 54556.635 5e+05   c 
 f3(txt) 57.655 59.466 63.36906 60.674 61.881  1651.448 5e+05    d
 f4(txt) 22.036 23.545 25.56820 24.451 25.356  1660.504 5e+05 a   

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM