简体   繁体   中英

Find matching one, two or three word phrases in a string in R

I am trying to identify, match and extract two-word phrases within a char column of strings in a dataframe in R.

I have a sample list of terms such as:

phrases <- as.list(c("Business","Business Process", "Processes", "Business Processes"))

and a string like:

string <- "brings seamless integration among the business processes and financials."

I am using str_extract_all and sapply like this:

sapply(str_extract_all(tolower(string), paste(tolower(phrases), collapse = "|")), function(s) paste(s, collapse=', '))

Which is only identifying the single word terms and not the two-word phrase "business processes" which is needed as well.

The current output is: [1] "business, processes"

But I want to be able to get "business, processes, business processes"

I have tried using the patterns \\\\b and adding a \\\\s in between the two-word phrase but it didn't help.

How should I go about extracting both the one and two-word phrases?

EDIT: I need to retain the matches as a column within the dataframe - I tried the below three suggestions and am getting the following error:

Error in $<-.data.frame ( *tmp* , phrases, value = c("business", "process", : replacement has 267 rows, data has 495

My DataFrame has multiple columns with one column containing the strings to match against the phrases list. I need to be able to pull all the matches in as a comma seperated values within the same row of the string. Desired output

Row,   String,                                  Phrases
1,  Businesses are great,                       business
2,  Great thing are great, 
3,   Processes are great,                       processes
4,   Business Processes are great for business, business processes, processes, business

This seems to work

tmp <- sapply(phrases,function(x) regmatches(string,gregexpr(paste0("\\b",x,"\\b"),string,ignore.case = T)))
 > unlist(tmp)
[1] "business"           "processes"          "business processes"
unname(mapply(function(x,y)str_extract(x,paste0(tolower(y),"\\b")),string,phrases))
[1] "business"           NA                   "processes"          "business processes"

Using grepl :

unlist(phrases[sapply(phrases, function(x) grepl(paste0("\\<", x, "\\>"), string, ignore.case = T))])
#[1] "Business"           "Processes"          "Business Processes"

or for all lower case:

unlist(tolower(phrases)[sapply(tolower(phrases), function(x) grepl(paste0("\\<", x, "\\>"), tolower(string)))])
#[1] "business"           "processes"          "business processes"

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM