I would like to build a n-gram 'letter document matrix', which basically uses letter sequences of up to n letters instead of the typical words. Here's a simplified example of what I'd like to achieve:
> letterDocumentMatrix(c('ea','ab','ca'), c('sea','abs','cab'))
[,sea] [,abs] [,cab]
[ea,] TRUE FALSE FALSE
[ab,] FALSE TRUE TRUE
[ca,] FALSE FALSE TRUE
Is there a name for this type of operation? And are there any prebuilt functions that handles this?
Finally, I tried outer with grepl but to no avail:
> outer(c('ea','ab','ca'), c('sea','abs','cab'), grepl)
[,1] [,2] [,3]
[1,] TRUE FALSE FALSE
[2,] TRUE FALSE FALSE
[3,] TRUE FALSE FALSE
Warning message:
In FUN(X, Y, ...) :
argument 'pattern' has length > 1 and only the first element will be used
Seems like outer passes the whole of the first argument to grepl, instead of one entry at a time, causing grepl to just search for the first term, which is 'a' in this case.
grepl()
is not vectorized over its pattern
argument, which is why you are not getting the correct result from outer()
. Here is a possible solution using vapply()
.
vec <- c("sea", "abs", "cab") ## vector to search
pat <- c("ea", "ab", "ca") ## patterns we are searching for
"rownames<-"(vapply(pat, grepl, NA[seq_along(pat)], vec, fixed = TRUE), vec)
# ea ab ca
# sea TRUE FALSE FALSE
# abs FALSE TRUE FALSE
# cab FALSE TRUE TRUE
This obviously results in a transposed version of what you want. To get the matrix exactly as you desire, we can use lapply()
, rbind()
the result, then set the names.
xx <- do.call(rbind, lapply(pat, grepl, x = vec, fixed = TRUE))
dimnames(xx) <- list(pat, vec)
# sea abs cab
# ea TRUE FALSE FALSE
# ab FALSE TRUE TRUE
# ca FALSE FALSE TRUE
I would say to use t()
on the vapply()
result to transpose it, but it can be slow on large matrices.
We could Vectorize
the FUN in outer
outer(c('ea','ab','ca'), c('sea','abs','cab'), Vectorize(grepl))
# [,1] [,2] [,3]
#[1,] TRUE FALSE FALSE
#[2,] FALSE TRUE TRUE
#[3,] FALSE FALSE TRUE
There is a prebuilt function to handle this from the quanteda package for text analysis, that would involve you treating your letter sequences as a "dictionary: of regular expressions and building a document-feature matrix where those regular expressions are identified in each "document". By tidying up a call to the dfm()
function with a dictionary applied, you will get your exact return object. Here I have transposed it as in your question.
letterDocumentMatrix <- function(txts, pats) {
# create a dictionary in which the key is the same as the entry
pats <- quanteda::dictionary(sapply(pats, list))
# name each "document" which is the text string to be searched
names(txts) <- txts
# interpret dictionary entries as regular expressions
ret <- quanteda::dfm(txts, dictionary = pats, valuetype = "regex", verbose = FALSE)
# transpose the matrix, coerce to dense logical matrix, remove dimnames
ret <- t(as.matrix(ret > 0))
names(dimnames(ret)) <- NULL
ret
}
texts <- c('sea','abs','cab')
patterns <- c('ea','ab','ca')
letterDocumentMatrix(texts, patterns)
## sea abs cab
## ea TRUE FALSE FALSE
## ab FALSE TRUE TRUE
## ca FALSE FALSE TRUE
If you want this to work quickly and on large datasets, I suggest removing the third and second to last lines from the function.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.