I have 3 vectors. One contains text or the actual words/sentences(text), One vector contains the words I want to search for (xreg) and the 3rd vector (categories) contains the categories each text should belong to if a match is found. Here are the 3 vectors:
text <- c("Sole Service here", "Freedom to Include","Freedom to Incl","Premier Reg",
"Bankhall","Bankhall","Premier Regiona","St James Play",
"Premier Regional","Health online","Premier Regional",
"Tenet","Health on line","Tenet","Nations","Woolwich",
"Premier Regional","Lifesearch","Nations","Bankhall",
"Premier Regional","Sole Service her","Lifesearch",
"Premier Regional","Sole Service","Nations",
"Sole Service","First Money service","Sole Service",
"Nations wide","Sole Service","Premier Region")
text <- tolower(text)
xreg <- c("sole","freedom","premier","bankhall","james","health","tennet",
"nations","woolwich","life","money")
categories <- c("SS", "FD", "PR", "BK", "JM", "HT", "TT", "NT", "WW", "LF", "MY")
I want to search through the ' text ' vector based on the search words present in the ' xreg ' vector. And then upon finding a match I want to put those words into the category mentioned in the ' categories ' vector.
So something like, look for the word 'sole' and where there is a match note down the index of that word or just simply create a data frame with the words and then a separate column to state the category it should belong too. In the case of 'sole' put it in the 'SS' category. 'freedom' put it in the 'FD' category and so on.
Solution so far: I can search one by one for each keyword and it will tell me the indexes where it finds a match.
reg_func <- function(x){grep(x,text)
}
reg_func("sole")
reg_func("freedom")
This will give me the indexes for each matched word, which I can then use to update the categories. Is there a way I can do it quicker? rather then searching one word at a time? Thanks
You can do it like this:
data: (modified to have a double match in 1. entry und no match last entry)
text <- c("Sole Service here, premier", "Freedom to Include","Freedom to Incl","Premier Reg",
"Bankhall","Bankhall","Premier Regiona","St James Play",
"Premier Regional","Health online","Premier Regional",
"Tenet","Health on line","Tenet","Nations","Woolwich",
"Premier Regional","Lifesearch","Nations","Bankhall",
"Premier Regional","Sole Service her","Lifesearch",
"Premier Regional","Sole Service","Nations",
"Sole Service","First Money service","Sole Service",
"Nations wide","Sole Service","Premier Region", "no match in here!!!")
#text <- tolower(text) # not needed, use ignore.case = T later
xreg <- c("sole","freedom","premier","bankhall","james","health","tennet",
"nations","woolwich","life","money")
categories <- c("SS", "FD", "PR", "BK", "JM", "HT", "TT", "NT", "WW", "LF", "MY")
code:
names(categories) = xreg # create named vector
ans <- data.frame(text = I(text)) # create a data.frame where you store it all.
ans$xreg_m<-
apply(
sapply(xreg, function(x) {grepl(x, text, ignore.case = T)}), 1, function(x) xreg[x]
)
ans$xreg_m[!lengths(ans$xreg_m)] <- NA # if no match is found. character(0) is returned. I want to have NA instead. character(0) has a length of 0. I'm using this knowledge to find them.
ans$categories_m<-
sapply(ans$xreg_m, function(x) unique(unname( categories[x] )))
result:
# text xreg_m categories_m
#1 Sole Service here, premier sole, premier SS, PR
#2 Freedom to Include freedom FD
#3 Freedom to Incl freedom FD
#4 Premier Reg premier PR
#5 Bankhall bankhall BK
#6 Bankhall bankhall BK
#7 Premier Regiona premier PR
#8 St James Play james JM
#9 Premier Regional premier PR
#10 Health online health HT
#11 Premier Regional premier PR
#12 Tenet NA NA
#13 Health on line health HT
#14 Tenet NA NA
#15 Nations nations NT
#16 Woolwich woolwich WW
#17 Premier Regional premier PR
#18 Lifesearch life LF
#19 Nations nations NT
#20 Bankhall bankhall BK
#21 Premier Regional premier PR
#22 Sole Service her sole SS
#23 Lifesearch life LF
#24 Premier Regional premier PR
#25 Sole Service sole SS
#26 Nations nations NT
#27 Sole Service sole SS
#28 First Money service money MY
#29 Sole Service sole SS
#30 Nations wide nations NT
#31 Sole Service sole SS
#32 Premier Region premier PR
#33 no match in here!!! NA NA
Explaining the functions used in @Andre Elrico answer
apply(
sapply(xreg, function(x) {grepl(x, text, ignore.case = T)}), 1, function(x) xreg[x]
)
# Apply each xreg pattern to the text vector and see if there's a match
# result is TRUE or FALSE gives each index where there is a match
sapply(xreg, function(x) {grepl(x, text, ignore.case = T)})
sole freedom premier bankhall james health tennet nations woolwich life money
[1,] TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[2,] FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[3,] FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[4,] FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[5,] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[6,] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
# Now apply each xreg element to the TRUE's from the previous result
# and see which element of xreg it matches with
apply(
sapply(xreg, function(x) {grepl(x, text, ignore.case = T)}), 1, function(x) xreg[x]
)
Result
[[1]]
[1] "sole" "premier"
[[2]]
[1] "freedom"
[[3]]
[1] "freedom"
[[4]]
[1] "premier"
[[5]]
[1] "bankhall"
[[6]]
[1] "bankhall"
Now to get the categories for each of our matched term (Regex)
sapply(ans$xreg_m, function(x) unique(unname( categories[x] )))
which says:
# Take each element of xreg_m (our matched terms) and
# see which element in the categories vector it matches with
# Then unname the result so you only get the category
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.