How to tokenize documents in R and list tokens by original document title?

Question

I have data frame D containing a document title and the text as in the following example:

document   content
Doc 1      "This is an example of a document"
Doc 2      "And another one"

I need to use the tokenize function from quanteda package in order to tokenize every document and then return the tokens listed by its original document title as in this example:

document   content
    Doc 1      "This"
    Doc 1      "This is"
    Doc 1      "This is an"
    Doc 1      "This is an example"

This is my current process to obtain a data frame with tokens from a list of documents:

require(textreadr)
D<-textreadr::read_dir("myDir")
D<-paste(D$content,collapse=" ")
strlist<-paste0(c(":","\\)",":","'",";","!","+","&","<",">","\\(","\\[","\\]","-","#",","),collapse = "|")
D<-gsub(strlist, "", D)
library(quanteda)
require(quanteda)
t<-tokenize(D, what = c("word","sentence", "character","fastestword", "fasterword"), 
            remove_numbers = FALSE, remove_punct = FALSE,
            remove_symbols = FALSE, remove_separators = TRUE,
            remove_twitter = FALSE, remove_hyphens = FALSE, remove_url = FALSE,
            ngrams = 1:10, concatenator = " ", hash = TRUE,
            verbose = quanteda_options("verbose"))
t<-unlist(t, use.names=FALSE)
t1<-data.frame(t)

However, I can't find an easy way to keep the document names after the tokenization process and list the tokens accordingly. Could anyone help with this?

Answer 1

R's list objects can take string indices like so:

my_list = list()

document_title = 'asdf.txt'
my_data = tokenize( etc... )
my_list[[document_title]] = my_data

Use your existing code, but assign your final data frame to a list like:

my_list[[document_title]] = data.frame(t)

Answer 2

Got to the bottom of it with a function. Here is the code for anyone interested:

myFunction <- function(x){

b <- x[2]
b<-paste(b,collapse=" ")

require(quanteda)
value <- tokenize(b, what = c("word","sentence", "character","fastestword", "fasterword"), 
                            remove_numbers = FALSE, remove_punct = FALSE,
                            remove_symbols = FALSE, remove_separators = TRUE,
                            remove_twitter = FALSE, remove_hyphens = FALSE, remove_url = FALSE,
                            ngrams = 1:10, concatenator = " ", hash = TRUE,
                            verbose = quanteda_options("verbose"))

value<-unlist(value, use.names=FALSE)

return(value)
        }

D$out <- apply(D, 1, myFunction)

library(tidyr)
D<-unnest(D)

How to tokenize documents in R and list tokens by original document title?

Question

2 answers

solution1
0 2017-11-17 19:21:22

solution2
0 2017-11-20 14:06:11

How to tokenize documents in R and list tokens by original document title?

Question

2 answers

solution1 0 2017-11-17 19:21:22

solution2 0 2017-11-20 14:06:11

solution1
0 2017-11-17 19:21:22

solution2
0 2017-11-20 14:06:11