Term Document Matrix for Letters in R

Question

I would like to build a n-gram 'letter document matrix', which basically uses letter sequences of up to n letters instead of the typical words. Here's a simplified example of what I'd like to achieve:

> letterDocumentMatrix(c('ea','ab','ca'), c('sea','abs','cab'))
    [,sea] [,abs] [,cab]
[ea,] TRUE   FALSE  FALSE  
[ab,] FALSE  TRUE   TRUE   
[ca,] FALSE  FALSE  TRUE

Is there a name for this type of operation? And are there any prebuilt functions that handles this?

Finally, I tried outer with grepl but to no avail:

> outer(c('ea','ab','ca'), c('sea','abs','cab'), grepl)
          [,1]  [,2]  [,3]
     [1,] TRUE  FALSE FALSE  
     [2,] TRUE  FALSE FALSE
     [3,] TRUE  FALSE FALSE  
     Warning message:
     In FUN(X, Y, ...) :
       argument 'pattern' has length > 1 and only the first element will be used

Seems like outer passes the whole of the first argument to grepl, instead of one entry at a time, causing grepl to just search for the first term, which is 'a' in this case.

Answer 1

grepl() is not vectorized over its pattern argument, which is why you are not getting the correct result from outer() . Here is a possible solution using vapply() .

vec <- c("sea", "abs", "cab") ## vector to search
pat <- c("ea", "ab", "ca")    ## patterns we are searching for
"rownames<-"(vapply(pat, grepl, NA[seq_along(pat)], vec, fixed = TRUE), vec)
#        ea    ab    ca
# sea  TRUE FALSE FALSE
# abs FALSE  TRUE FALSE
# cab FALSE  TRUE  TRUE

This obviously results in a transposed version of what you want. To get the matrix exactly as you desire, we can use lapply() , rbind() the result, then set the names.

xx <- do.call(rbind, lapply(pat, grepl, x = vec, fixed = TRUE))
dimnames(xx) <- list(pat, vec)
#      sea   abs   cab
# ea  TRUE FALSE FALSE
# ab FALSE  TRUE  TRUE
# ca FALSE FALSE  TRUE

I would say to use t() on the vapply() result to transpose it, but it can be slow on large matrices.

Answer 2

We could Vectorize the FUN in outer

outer(c('ea','ab','ca'), c('sea','abs','cab'), Vectorize(grepl))
#     [,1]  [,2]  [,3]
#[1,]  TRUE FALSE FALSE
#[2,] FALSE  TRUE  TRUE
#[3,] FALSE FALSE  TRUE

Answer 3

There is a prebuilt function to handle this from the quanteda package for text analysis, that would involve you treating your letter sequences as a "dictionary: of regular expressions and building a document-feature matrix where those regular expressions are identified in each "document". By tidying up a call to the dfm() function with a dictionary applied, you will get your exact return object. Here I have transposed it as in your question.

letterDocumentMatrix <- function(txts, pats) {
    # create a dictionary in which the key is the same as the entry
    pats <- quanteda::dictionary(sapply(pats, list))
    # name each "document" which is the text string to be searched
    names(txts) <- txts
    # interpret dictionary entries as regular expressions
    ret <- quanteda::dfm(txts, dictionary = pats, valuetype = "regex", verbose = FALSE)
    # transpose the matrix, coerce to dense logical matrix, remove dimnames
    ret <- t(as.matrix(ret > 0))
    names(dimnames(ret)) <- NULL
    ret
}

texts <- c('sea','abs','cab')
patterns <- c('ea','ab','ca')

letterDocumentMatrix(texts, patterns)
##      sea   abs   cab
## ea  TRUE FALSE FALSE
## ab FALSE  TRUE  TRUE
## ca FALSE FALSE  TRUE

If you want this to work quickly and on large datasets, I suggest removing the third and second to last lines from the function.

Term Document Matrix for Letters in R

Question

3 answers

solution1
3 ACCPTED 2015-10-26 02:01:38

solution2
1 2015-10-26 02:12:09

solution3
0 2015-10-26 07:22:02

Term Document Matrix for Letters in R

Question

3 answers

solution1 3 ACCPTED 2015-10-26 02:01:38

solution2 1 2015-10-26 02:12:09

solution3 0 2015-10-26 07:22:02

solution1
3 ACCPTED 2015-10-26 02:01:38

solution2
1 2015-10-26 02:12:09

solution3
0 2015-10-26 07:22:02