简体   繁体   English

如何标记 R 中的文本列表

[英]How do I tokenize a list of texts in R

I have a list of texts imported from 10 documents, such as:我有一个从 10 个文档中导入的文本列表,例如:

library(quanteda)
library(readtext)
path <- "the working direction"
doc1 <- readtext(paste0(path, "/*_XXX.docx"))

view(doc1) looks like视图(doc1)看起来像
[[1]] character(1) 'some words' [[1]] character(1) '一些单词'
[[2]] character(2) 'some words' [[2]] character(2) '一些单词'
... ...

Now, I need to tokenize this list of texts, so I used现在,我需要标记这个文本列表,所以我使用了

tok_cov1 <- doc1 %>% 
  tokens(remove_punct = TRUE,
         remove_numbers = TRUE,
         remove_symbols = TRUE) %>% 
  tokens_tolower(keep_acronyms = TRUE) %>% 
  tokens_wordstem() %>% 
  tokens_remove(pattern = stopwords("en"))

The code did not return with any error, but would not tokenize anything.该代码没有返回任何错误,但不会标记任何内容。 The doc1 still looks the same as untokenized. doc1 看起来仍然与未标记的相同。

I am aware that specifying 'doc1' as 'doc1[[n]]' would return with tokens from the corresponding texts, e,g.,我知道将 'doc1' 指定为 'doc1[[n]]' 将返回相应文本中的标记,例如,

tok_cov1 <- doc1[[1]] %>% 
  tokens(remove_punct = TRUE,
         remove_numbers = TRUE,
         remove_symbols = TRUE) %>% 
  tokens_tolower(keep_acronyms = TRUE) %>% 
  tokens_wordstem() %>% 
  tokens_remove(pattern = stopwords("en"))

However, I'd need it works on every text instead of doing the text one by one.但是,我需要它适用于每个文本,而不是逐个处理文本。 Any help is greatly appreciated.任何帮助是极大的赞赏。 Thank you.谢谢你。

The quanteda corpus() function works directly on objects created by readtext() . quanteda corpus()函数直接作用于由readtext()创建的对象。 So in your example, simply use:因此,在您的示例中,只需使用:

library(quanteda)
corpus(doc1) %>%
  tokens()

adding your preferred options for the tokenisation of course.当然,添加您首选的标记化选项。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM