简体   繁体   English

如何获取每个语料库的前25个字(在R中)?

[英]How to take first 25 words of each corpus (in R)?

I'm guessing that the technique for this is similar to taking the first N characters from any dataframe, regardless of if it is a corpus or not. 我猜想,这项技术类似于从任何数据帧中获取前N个字符,而不管它是否是一个主体。

My attempt: 我的尝试:

create.greetings <- function(corpus, create_df = FALSE) {
  for(i in length(Charlotte.corpus.raw)) {
    Doc1<-Charlotte.corpus.raw[i]
    Word1<-Doc1[1:25]
    Greetings[i]<-Word1
  }
  return(VCorpus)
}

Where Greetings begins as a corpus with n=6. Greetings以n = 6的语料库开始。 I couldn't figure out how to make a null corpus, or a corpus of large enough characters. 我不知道如何制作一个空的语料库或足够大字符的语料库。 I have a corpus of 200 documents here ( Charlotte.corpus.raw ). 我这里有200个文档的语料库( Charlotte.corpus.raw )。 Unlike vectors (and by extension, dataframes), there doesn't seem to be a easy way to create null corpora. 与向量(以及扩展而言,数据帧)不同,似乎没有一种简单的方法来创建空语料库。

Part of the problem is that R doesn't seem to recognize the class of "document". 问题的部分原因是R似乎无法识别“文档”类。 It only recognizes corpus. 它仅识别语料库。 That is, that to R, a single document is a corpus of n=1. 也就是说,对于R,单个文档是n = 1的语料库。

Reproducable Sample: You will need the 'tm' and 'dplyr' and 'NLP' packages as well as more common R packages 可重现的样本:您将需要'tm'和'dplyr'和'NLP'软件包以及更常见的R软件包

read.corpus <- function(directory, pattern = "", to.lower = TRUE) {
 corpus <- DirSource(directory = directory, pattern = pattern) %>%
   VCorpus # Read files and create `VCorpus` object
 if(to.lower == TRUE) corpus <- # Lowercase text
     tm_map(corpus, 
            content_transformer(tolower))
 return(corpus)
}

Then run the function for any directory you have with a few txt documents, then you have a corpus to work with. 然后对包含几个txt文档的任何目录运行该函数,然后可以使用一个语料库。 Then replace Charlotte.corpus.raw from above with whatever you name your corpus as. 然后从上方用您命名的语料库替换Charlotte.corpus.raw。

Each row of greetings will contain the first 25 words of each document: 每行问候语将包含每个文档的前25个字:

greetings <- c()
for(i in 1:length(corpus)) {
  row <- unlist(corpus[i])[1:25]
  greetings <- rbind(greetings, row)
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM