[英]How to tokenize documents in R and list tokens by original document title?
我的数据框D
包含文档标题和文本,如以下示例所示:
document content
Doc 1 "This is an example of a document"
Doc 2 "And another one"
我需要使用quanteda
软件包中的tokenize
函数来标记每个文档,然后返回其原始文档标题列出的标记,如以下示例所示:
document content
Doc 1 "This"
Doc 1 "This is"
Doc 1 "This is an"
Doc 1 "This is an example"
这是我当前从文档列表中获取带有标记的数据框的过程:
require(textreadr)
D<-textreadr::read_dir("myDir")
D<-paste(D$content,collapse=" ")
strlist<-paste0(c(":","\\)",":","'",";","!","+","&","<",">","\\(","\\[","\\]","-","#",","),collapse = "|")
D<-gsub(strlist, "", D)
library(quanteda)
require(quanteda)
t<-tokenize(D, what = c("word","sentence", "character","fastestword", "fasterword"),
remove_numbers = FALSE, remove_punct = FALSE,
remove_symbols = FALSE, remove_separators = TRUE,
remove_twitter = FALSE, remove_hyphens = FALSE, remove_url = FALSE,
ngrams = 1:10, concatenator = " ", hash = TRUE,
verbose = quanteda_options("verbose"))
t<-unlist(t, use.names=FALSE)
t1<-data.frame(t)
但是,我找不到在标记化过程之后保留文档名称并相应列出标记的简便方法。 有人可以帮忙吗?
R的列表对象可以采用如下所示的字符串索引:
my_list = list()
document_title = 'asdf.txt'
my_data = tokenize( etc... )
my_list[[document_title]] = my_data
使用现有代码,但将最终数据框分配给类似以下的列表:
my_list[[document_title]] = data.frame(t)
使用功能转到其底部。 这是任何有兴趣的人的代码:
myFunction <- function(x){
b <- x[2]
b<-paste(b,collapse=" ")
require(quanteda)
value <- tokenize(b, what = c("word","sentence", "character","fastestword", "fasterword"),
remove_numbers = FALSE, remove_punct = FALSE,
remove_symbols = FALSE, remove_separators = TRUE,
remove_twitter = FALSE, remove_hyphens = FALSE, remove_url = FALSE,
ngrams = 1:10, concatenator = " ", hash = TRUE,
verbose = quanteda_options("verbose"))
value<-unlist(value, use.names=FALSE)
return(value)
}
D$out <- apply(D, 1, myFunction)
library(tidyr)
D<-unnest(D)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.