简体   繁体   中英

Term Document Matrix for Letters in R

I would like to build a n-gram 'letter document matrix', which basically uses letter sequences of up to n letters instead of the typical words. Here's a simplified example of what I'd like to achieve:

> letterDocumentMatrix(c('ea','ab','ca'), c('sea','abs','cab'))
    [,sea] [,abs] [,cab]
[ea,] TRUE   FALSE  FALSE  
[ab,] FALSE  TRUE   TRUE   
[ca,] FALSE  FALSE  TRUE

Is there a name for this type of operation? And are there any prebuilt functions that handles this?

Finally, I tried outer with grepl but to no avail:

> outer(c('ea','ab','ca'), c('sea','abs','cab'), grepl)
          [,1]  [,2]  [,3]
     [1,] TRUE  FALSE FALSE  
     [2,] TRUE  FALSE FALSE
     [3,] TRUE  FALSE FALSE  
     Warning message:
     In FUN(X, Y, ...) :
       argument 'pattern' has length > 1 and only the first element will be used

Seems like outer passes the whole of the first argument to grepl, instead of one entry at a time, causing grepl to just search for the first term, which is 'a' in this case.

grepl() is not vectorized over its pattern argument, which is why you are not getting the correct result from outer() . Here is a possible solution using vapply() .

vec <- c("sea", "abs", "cab") ## vector to search
pat <- c("ea", "ab", "ca")    ## patterns we are searching for
"rownames<-"(vapply(pat, grepl, NA[seq_along(pat)], vec, fixed = TRUE), vec)
#        ea    ab    ca
# sea  TRUE FALSE FALSE
# abs FALSE  TRUE FALSE
# cab FALSE  TRUE  TRUE

This obviously results in a transposed version of what you want. To get the matrix exactly as you desire, we can use lapply() , rbind() the result, then set the names.

xx <- do.call(rbind, lapply(pat, grepl, x = vec, fixed = TRUE))
dimnames(xx) <- list(pat, vec)
#      sea   abs   cab
# ea  TRUE FALSE FALSE
# ab FALSE  TRUE  TRUE
# ca FALSE FALSE  TRUE

I would say to use t() on the vapply() result to transpose it, but it can be slow on large matrices.

We could Vectorize the FUN in outer

outer(c('ea','ab','ca'), c('sea','abs','cab'), Vectorize(grepl))
#     [,1]  [,2]  [,3]
#[1,]  TRUE FALSE FALSE
#[2,] FALSE  TRUE  TRUE
#[3,] FALSE FALSE  TRUE

There is a prebuilt function to handle this from the quanteda package for text analysis, that would involve you treating your letter sequences as a "dictionary: of regular expressions and building a document-feature matrix where those regular expressions are identified in each "document". By tidying up a call to the dfm() function with a dictionary applied, you will get your exact return object. Here I have transposed it as in your question.

letterDocumentMatrix <- function(txts, pats) {
    # create a dictionary in which the key is the same as the entry
    pats <- quanteda::dictionary(sapply(pats, list))
    # name each "document" which is the text string to be searched
    names(txts) <- txts
    # interpret dictionary entries as regular expressions
    ret <- quanteda::dfm(txts, dictionary = pats, valuetype = "regex", verbose = FALSE)
    # transpose the matrix, coerce to dense logical matrix, remove dimnames
    ret <- t(as.matrix(ret > 0))
    names(dimnames(ret)) <- NULL
    ret
}

texts <- c('sea','abs','cab')
patterns <- c('ea','ab','ca')

letterDocumentMatrix(texts, patterns)
##      sea   abs   cab
## ea  TRUE FALSE FALSE
## ab FALSE  TRUE  TRUE
## ca FALSE FALSE  TRUE

If you want this to work quickly and on large datasets, I suggest removing the third and second to last lines from the function.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM