简体   繁体   中英

Extract subset of a string following specific text in R

I am trying to extract all of the words in the string below contained within the brackets following the word 'tokens' only if the 'tokens' occurs after 'tag(noun)'.

For example, I have the string:

m<- "phrase('The New York State Department',[det([lexmatch(['THE']),
inputmatch(['The']),tag(det),tokens([the])]),mod([lexmatch(['New York State']),
inputmatch(['New','York','State']),tag(noun),tokens([new,york,state])]),
head([lexmatch([department]),inputmatch(['Department']),tag(noun),
tokens([department])])],0/29,[])."

I want to get a list of all of the words that occur within the brackets after the word 'tokens' only when the word tokens occurs after 'tag(noun)'.

Therefore, I want my output to be a vector of the following:

[1] new, york, state, department

How do I do this? I'm assuming I have to use a regular expression, but I'm lost on how to write this in R.

Thanks!

Remove newlines and then extract the portion matched to the part between parentheses in pattern pat . Then split apart such strings by commas and simplify into a character vector:

library(gsubfn)

pat <- "tag.noun.,tokens..(.*?)\\]"
strapply(gsub("\\n", "", m), pat, ~ unlist(strsplit(x, ",")), simplify = c)

giving:

[1] "new"        "york"       "state"      "department"

Visualization: Here is the debuggex representation of the regular expression in pat . (Note that we need to double the backslash when put within R's double quotes):

 tag.noun.,tokens..(.*?)\]

正则表达式可视化

Debuggex Demo

Note that .*? means match the shortetst string of any characters such that the entire pattern matches - without the ? it would try to match the longest string.

How about something like this. Here i'll use the regcatputedmatches helper function to make it easier to extract the captured matches.

m<- "phrase('The New York State Department',[det([lexmatch(['THE']),inputmatch(['The']),tag(det),tokens([the])]),mod([lexmatch(['New York State']),inputmatch(['New','York','State']),tag(noun),tokens([new,york,state])]),head([lexmatch([department]),inputmatch(['Department']),tag(noun),tokens([department])])],0/29,[])."

rx <- gregexpr("tag\\(noun\\),tokens\\(\\[([^]]+)\\]\\)", m, perl=T)
lapply(regcapturedmatches(m,rx), function(x) {
    unlist(strsplit(c(x),","))
})

# [[1]]
# [1] "new"        "york"       "state"      "department"

The regular expression is a bit messy because your desired match contains many special regular expression symbols so we need to properly escape them.

Here is a one liner if you like:

paste(unlist(regmatches(m, gregexpr("(?<=tag\\(noun\\),tokens\\(\\[)[^\\]]*", m, perl=T))), collapse=",")
[1] "new,york,state,department"

Broken down:

# Get match indices
indices <- gregexpr("(?<=tag\\(noun\\),tokens\\(\\[)[^\\]]*", m, perl=T)

# Extract the matches
matches <- regmatches(m, indices)

# unlist and paste together
paste(unlist(matches), collapse=",")
[1] "new,york,state,department"

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM