简体   繁体   English

如何在R中标记文档并按原始文档标题列出标记?

[英]How to tokenize documents in R and list tokens by original document title?

I have data frame D containing a document title and the text as in the following example: 我的数据框D包含文档标题和文本,如以下示例所示:

document   content
Doc 1      "This is an example of a document"
Doc 2      "And another one"

I need to use the tokenize function from quanteda package in order to tokenize every document and then return the tokens listed by its original document title as in this example: 我需要使用quanteda软件包中的tokenize函数来标记每个文档,然后返回其原始文档标题列出的标记,如以下示例所示:

document   content
    Doc 1      "This"
    Doc 1      "This is"
    Doc 1      "This is an"
    Doc 1      "This is an example" 

This is my current process to obtain a data frame with tokens from a list of documents: 这是我当前从文档列表中获取带有标记的数据框的过程:

require(textreadr)
D<-textreadr::read_dir("myDir")
D<-paste(D$content,collapse=" ")
strlist<-paste0(c(":","\\)",":","'",";","!","+","&","<",">","\\(","\\[","\\]","-","#",","),collapse = "|")
D<-gsub(strlist, "", D)
library(quanteda)
require(quanteda)
t<-tokenize(D, what = c("word","sentence", "character","fastestword", "fasterword"), 
            remove_numbers = FALSE, remove_punct = FALSE,
            remove_symbols = FALSE, remove_separators = TRUE,
            remove_twitter = FALSE, remove_hyphens = FALSE, remove_url = FALSE,
            ngrams = 1:10, concatenator = " ", hash = TRUE,
            verbose = quanteda_options("verbose"))
t<-unlist(t, use.names=FALSE)
t1<-data.frame(t)

However, I can't find an easy way to keep the document names after the tokenization process and list the tokens accordingly. 但是,我找不到在标记化过程之后保留文档名称并相应列出标记的简便方法。 Could anyone help with this? 有人可以帮忙吗?

R's list objects can take string indices like so: R的列表对象可以采用如下所示的字符串索引:

my_list = list()

document_title = 'asdf.txt'
my_data = tokenize( etc... )
my_list[[document_title]] = my_data

Use your existing code, but assign your final data frame to a list like: 使用现有代码,但将最终数据框分配给类似以下的列表:

my_list[[document_title]] = data.frame(t)

Got to the bottom of it with a function. 使用功能转到其底部。 Here is the code for anyone interested: 这是任何有兴趣的人的代码:

myFunction <- function(x){

b <- x[2]
b<-paste(b,collapse=" ")

require(quanteda)
value <- tokenize(b, what = c("word","sentence", "character","fastestword", "fasterword"), 
                            remove_numbers = FALSE, remove_punct = FALSE,
                            remove_symbols = FALSE, remove_separators = TRUE,
                            remove_twitter = FALSE, remove_hyphens = FALSE, remove_url = FALSE,
                            ngrams = 1:10, concatenator = " ", hash = TRUE,
                            verbose = quanteda_options("verbose"))

value<-unlist(value, use.names=FALSE)

return(value)
        }

D$out <- apply(D, 1, myFunction)

library(tidyr)
D<-unnest(D)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM