如何标记 R 中的文本列表

Question

I have a list of texts imported from 10 documents, such as:我有一个从 10 个文档中导入的文本列表，例如：

library(quanteda)
library(readtext)
path <- "the working direction"
doc1 <- readtext(paste0(path, "/*_XXX.docx"))

view(doc1) looks like视图（doc1）看起来像
[[1]] character(1) 'some words' [[1]] character(1) '一些单词'
[[2]] character(2) 'some words' [[2]] character(2) '一些单词'
... ...

Now, I need to tokenize this list of texts, so I used现在，我需要标记这个文本列表，所以我使用了

tok_cov1 <- doc1 %>% 
  tokens(remove_punct = TRUE,
         remove_numbers = TRUE,
         remove_symbols = TRUE) %>% 
  tokens_tolower(keep_acronyms = TRUE) %>% 
  tokens_wordstem() %>% 
  tokens_remove(pattern = stopwords("en"))

The code did not return with any error, but would not tokenize anything.该代码没有返回任何错误，但不会标记任何内容。 The doc1 still looks the same as untokenized. doc1 看起来仍然与未标记的相同。

I am aware that specifying 'doc1' as 'doc1[[n]]' would return with tokens from the corresponding texts, e,g.,我知道将 'doc1' 指定为 'doc1[[n]]' 将返回相应文本中的标记，例如，

tok_cov1 <- doc1[[1]] %>% 
  tokens(remove_punct = TRUE,
         remove_numbers = TRUE,
         remove_symbols = TRUE) %>% 
  tokens_tolower(keep_acronyms = TRUE) %>% 
  tokens_wordstem() %>% 
  tokens_remove(pattern = stopwords("en"))

However, I'd need it works on every text instead of doing the text one by one.但是，我需要它适用于每个文本，而不是逐个处理文本。 Any help is greatly appreciated.任何帮助是极大的赞赏。 Thank you.谢谢你。

Answer 1

The quanteda corpus() function works directly on objects created by readtext() . quanteda corpus()函数直接作用于由readtext()创建的对象。 So in your example, simply use:因此，在您的示例中，只需使用：

library(quanteda)
corpus(doc1) %>%
  tokens()

adding your preferred options for the tokenisation of course.当然，添加您首选的标记化选项。

如何标记 R 中的文本列表

问题描述

1 个解决方案

解决方案1
0 2022-05-29 20:27:27

如何标记 R 中的文本列表

问题描述

1 个解决方案

解决方案1 0 2022-05-29 20:27:27

解决方案1
0 2022-05-29 20:27:27