简体   繁体   中英

Count Pattern Matching in R

How would one efficiently count the number of instances of one character string which occur within another character string?

Below is my code to date. It successfully identifies if any instance of the one string occurs in the other string. However, I do not know how to extend it from a TRUE/FALSE relationship to a counting relationship.

x <- ("Hello my name is Christopher. Some people call me Chris")
y <- ("Chris is an interesting person to be around")
z <- ("Because he plays sports and likes statistics")

lll <- tolower(list(x,y,z))
dict <- tolower(c("Chris", "Hell"))

mmm <- matrix(nrow=length(lll), ncol=length(dict), NA)

for (i in 1:length(lll)) {
for (j in 1:length(dict)) {
    mmm[i,j] <- sum(grepl(dict[j],lll[i]))

It yields:

       [,1] [,2]
 [1,]    1    1
 [2,]    1    0
 [3,]    0    0

Since the lower-case string "chris" appears twice in the lll[1] I would like mmm[1,1] to be 2 instead of 1.

Real example is much higher dimension...so would love if code could be vectorized instead of using my brute force for loops.

Two quick tips:

  1. avoid the dual for-loop, you dont need it ;)
  2. use the stringr package


dict <- setNames(nm=dict)  # simply for neatness
lapply(dict, str_count, string=lll)
# $chris
# [1] 2 1 0
# $hell
# [1] 1 0 0

Or as a matrix:

#  sapply(dict, str_count, string=lll)
#      chris hell
# [1,]     2    1
# [2,]     1    0
# [3,]     0    0

而不是sum(grepl(dict[j],lll[i])) ,尝试sum(gregexpr(dict[j],lll[i])[[1]] > 0)

You can also do something like this:

count.matches <- function(pat, vec) sapply(regmatches(vec, gregexpr(pat, vec)), length)
mapply(count.matches, c('chris', 'hell'), list(lll))
#      chris hell
# [1,]     2    1
# [2,]     1    0
# [3,]     0    0

 do.call(rbind,Map(function(x,y)list(y,sum(gregexpr(y,x)[[1]] > 0)), llll,dict1))
                                                        [,1]    [,2]
hello my name is christopher. some people call me chris "chris" 2   
chris is an interesting person to be around             "chris" 1   
because he plays sports and likes statistics            "chris" 0   
hello my name is christopher. some people call me chris "hell"  1   
chris is an interesting person to be around             "hell"  0   
because he plays sports and likes statistics            "hell"  0  

You can then use reshape to get what you want.

This uses the qdap package. The CRAN version should work fine but you may want the dev version


termco(c(x, y, z), 1:3, c('chris', 'hell'))

##   3 word.count     chris      hell
## 1 1         10 2(20.00%) 1(10.00%)
## 2 2          8 1(12.50%)         0
## 3 3          7         0         0

termco(c(x, y, z), 1:3, c('chris', 'hell'))$raw

##   3 word.count chris hell
## 1 1         10     2    1
## 2 2          8     1    0
## 3 3          7     0    0

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM