简体   繁体   中英

break corpus into chunks of N words each in R

I need to break a corpus into chunks of N words each. Say this is my corpus:

corpus <- "I need to break this corpus into chunks of ~3 words each"

One way around this problem is turning the corpus into a dataframe, tokenizing it

library(tidytext)
corpus_df <- as.data.frame(text = corpus)
tokens <- corpus_df %>% unnest_tokens(word, text)

and then splitting the dataframe rowwise using the code below (taken from here ).

chunk <- 3
n <- nrow(tokens)
r  <- rep(1:ceiling(n/chunk),each=chunk)[1:n]
d <- split(tokens,r)

This works, but there must be a more direct way. Any takes?

To split a string into into N words you can use tokenizers::chunk_text() :

corpus <- "I need to break this corpus into chunks of ~3 words each"

library(tokenizers)
library(tidytext)
library(tibble)

corpus %>%
  chunk_text(3)

[[1]]
[1] "i need to"

[[2]]
[1] "break this corpus"

[[3]]
[1] "into chunks of"

[[4]]
[1] "3 words each"

To return a data frame you can do:

corpus %>%
  chunk_text(3) %>%
  enframe(name = "group", value = "text") %>%
  unnest_tokens(word, text)

# A tibble: 12 x 2
   group word  
   <int> <chr> 
 1     1 i     
 2     1 need  
 3     1 to    
 4     2 break 
 5     2 this  
 6     2 corpus
 7     3 into  
 8     3 chunks
 9     3 of    
10     4 3     
11     4 words 
12     4 each  

If you want these as a list of data frames of 3 separate words:

 corpus %>%
   chunk_text(3) %>%
   enframe(name = "group", value = "text") %>%
   unnest_tokens(word, text) %>%
   group_split(group)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM