I have data frame D
containing a document title and the text as in the following example:
document content
Doc 1 "This is an example of a document"
Doc 2 "And another one"
I need to use the tokenize
function from quanteda
package in order to tokenize every document and then return the tokens listed by its original document title as in this example:
document content
Doc 1 "This"
Doc 1 "This is"
Doc 1 "This is an"
Doc 1 "This is an example"
This is my current process to obtain a data frame with tokens from a list of documents:
require(textreadr)
D<-textreadr::read_dir("myDir")
D<-paste(D$content,collapse=" ")
strlist<-paste0(c(":","\\)",":","'",";","!","+","&","<",">","\\(","\\[","\\]","-","#",","),collapse = "|")
D<-gsub(strlist, "", D)
library(quanteda)
require(quanteda)
t<-tokenize(D, what = c("word","sentence", "character","fastestword", "fasterword"),
remove_numbers = FALSE, remove_punct = FALSE,
remove_symbols = FALSE, remove_separators = TRUE,
remove_twitter = FALSE, remove_hyphens = FALSE, remove_url = FALSE,
ngrams = 1:10, concatenator = " ", hash = TRUE,
verbose = quanteda_options("verbose"))
t<-unlist(t, use.names=FALSE)
t1<-data.frame(t)
However, I can't find an easy way to keep the document names after the tokenization process and list the tokens accordingly. Could anyone help with this?
R's list objects can take string indices like so:
my_list = list()
document_title = 'asdf.txt'
my_data = tokenize( etc... )
my_list[[document_title]] = my_data
Use your existing code, but assign your final data frame to a list like:
my_list[[document_title]] = data.frame(t)
Got to the bottom of it with a function. Here is the code for anyone interested:
myFunction <- function(x){
b <- x[2]
b<-paste(b,collapse=" ")
require(quanteda)
value <- tokenize(b, what = c("word","sentence", "character","fastestword", "fasterword"),
remove_numbers = FALSE, remove_punct = FALSE,
remove_symbols = FALSE, remove_separators = TRUE,
remove_twitter = FALSE, remove_hyphens = FALSE, remove_url = FALSE,
ngrams = 1:10, concatenator = " ", hash = TRUE,
verbose = quanteda_options("verbose"))
value<-unlist(value, use.names=FALSE)
return(value)
}
D$out <- apply(D, 1, myFunction)
library(tidyr)
D<-unnest(D)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.