Extract subset of a string following specific text in R

Question

I am trying to extract all of the words in the string below contained within the brackets following the word 'tokens' only if the 'tokens' occurs after 'tag(noun)'.

For example, I have the string:

m<- "phrase('The New York State Department',[det([lexmatch(['THE']),
inputmatch(['The']),tag(det),tokens([the])]),mod([lexmatch(['New York State']),
inputmatch(['New','York','State']),tag(noun),tokens([new,york,state])]),
head([lexmatch([department]),inputmatch(['Department']),tag(noun),
tokens([department])])],0/29,[])."

I want to get a list of all of the words that occur within the brackets after the word 'tokens' only when the word tokens occurs after 'tag(noun)'.

Therefore, I want my output to be a vector of the following:

[1] new, york, state, department

How do I do this? I'm assuming I have to use a regular expression, but I'm lost on how to write this in R.

Thanks!

Answer 1

Remove newlines and then extract the portion matched to the part between parentheses in pattern pat . Then split apart such strings by commas and simplify into a character vector:

library(gsubfn)

pat <- "tag.noun.,tokens..(.*?)\\]"
strapply(gsub("\\n", "", m), pat, ~ unlist(strsplit(x, ",")), simplify = c)

giving:

[1] "new"        "york"       "state"      "department"

Visualization: Here is the debuggex representation of the regular expression in pat . (Note that we need to double the backslash when put within R's double quotes):

 tag.noun.,tokens..(.*?)\]

正则表达式可视化

Debuggex Demo

Note that .*? means match the shortetst string of any characters such that the entire pattern matches - without the ? it would try to match the longest string.

Answer 2

How about something like this. Here i'll use the regcatputedmatches helper function to make it easier to extract the captured matches.

m<- "phrase('The New York State Department',[det([lexmatch(['THE']),inputmatch(['The']),tag(det),tokens([the])]),mod([lexmatch(['New York State']),inputmatch(['New','York','State']),tag(noun),tokens([new,york,state])]),head([lexmatch([department]),inputmatch(['Department']),tag(noun),tokens([department])])],0/29,[])."

rx <- gregexpr("tag\\(noun\\),tokens\\(\\[([^]]+)\\]\\)", m, perl=T)
lapply(regcapturedmatches(m,rx), function(x) {
    unlist(strsplit(c(x),","))
})

# [[1]]
# [1] "new"        "york"       "state"      "department"

The regular expression is a bit messy because your desired match contains many special regular expression symbols so we need to properly escape them.

Answer 3

Here is a one liner if you like:

paste(unlist(regmatches(m, gregexpr("(?<=tag\\(noun\\),tokens\\(\\[)[^\\]]*", m, perl=T))), collapse=",")
[1] "new,york,state,department"

Broken down:

# Get match indices
indices <- gregexpr("(?<=tag\\(noun\\),tokens\\(\\[)[^\\]]*", m, perl=T)

# Extract the matches
matches <- regmatches(m, indices)

# unlist and paste together
paste(unlist(matches), collapse=",")
[1] "new,york,state,department"

Extract subset of a string following specific text in R

Question

3 answers

solution1
1 ACCPTED 2014-12-04 19:23:50

solution2
0 2014-12-04 18:58:52

solution3
0 2014-12-04 19:23:18

Extract subset of a string following specific text in R

Question

3 answers

solution1 1 ACCPTED 2014-12-04 19:23:50

solution2 0 2014-12-04 18:58:52

solution3 0 2014-12-04 19:23:18

solution1
1 ACCPTED 2014-12-04 19:23:50

solution2
0 2014-12-04 18:58:52

solution3
0 2014-12-04 19:23:18