简体   繁体   中英

How to tokenize documents in R and list tokens by original document title?

I have data frame D containing a document title and the text as in the following example:

document   content
Doc 1      "This is an example of a document"
Doc 2      "And another one"

I need to use the tokenize function from quanteda package in order to tokenize every document and then return the tokens listed by its original document title as in this example:

document   content
    Doc 1      "This"
    Doc 1      "This is"
    Doc 1      "This is an"
    Doc 1      "This is an example" 

This is my current process to obtain a data frame with tokens from a list of documents:

require(textreadr)
D<-textreadr::read_dir("myDir")
D<-paste(D$content,collapse=" ")
strlist<-paste0(c(":","\\)",":","'",";","!","+","&","<",">","\\(","\\[","\\]","-","#",","),collapse = "|")
D<-gsub(strlist, "", D)
library(quanteda)
require(quanteda)
t<-tokenize(D, what = c("word","sentence", "character","fastestword", "fasterword"), 
            remove_numbers = FALSE, remove_punct = FALSE,
            remove_symbols = FALSE, remove_separators = TRUE,
            remove_twitter = FALSE, remove_hyphens = FALSE, remove_url = FALSE,
            ngrams = 1:10, concatenator = " ", hash = TRUE,
            verbose = quanteda_options("verbose"))
t<-unlist(t, use.names=FALSE)
t1<-data.frame(t)

However, I can't find an easy way to keep the document names after the tokenization process and list the tokens accordingly. Could anyone help with this?

R's list objects can take string indices like so:

my_list = list()

document_title = 'asdf.txt'
my_data = tokenize( etc... )
my_list[[document_title]] = my_data

Use your existing code, but assign your final data frame to a list like:

my_list[[document_title]] = data.frame(t)

Got to the bottom of it with a function. Here is the code for anyone interested:

myFunction <- function(x){

b <- x[2]
b<-paste(b,collapse=" ")

require(quanteda)
value <- tokenize(b, what = c("word","sentence", "character","fastestword", "fasterword"), 
                            remove_numbers = FALSE, remove_punct = FALSE,
                            remove_symbols = FALSE, remove_separators = TRUE,
                            remove_twitter = FALSE, remove_hyphens = FALSE, remove_url = FALSE,
                            ngrams = 1:10, concatenator = " ", hash = TRUE,
                            verbose = quanteda_options("verbose"))

value<-unlist(value, use.names=FALSE)

return(value)
        }

D$out <- apply(D, 1, myFunction)

library(tidyr)
D<-unnest(D)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM