Create a Corpus from a List of File Paths in R

Question

I have 1030 individual .txt files in a directory which represent all the participants in a research study.

I have successfully created a corpus for use with the tm package in R out of all the files in the directory.

Now I'm trying to create corpi of numerous subsets of these files. For example, one corpus of all the female authors and one of the male authors.

I was hoping to be able to pass the Corpus function subsets of a list of file paths, but this has not worked out.

Any help is appreciated. Here is an example to build from:

pathname <- c("C:/Desktop/Samples")

study.files <- list.files(path = pathname, pattern = NULL, all.files = T, full.names = T, recursive = T, ignore.case = T, include.dirs = T) 

### This gives me a character vector that is equivalent to:

study.files <- c("C:/Desktop/Samples/author1.txt","C:/Desktop/Samples/author2.txt","C:/Desktop/Samples/author3.txt","C:/Desktop/Samples/author4.txt","C:/Desktop/Samples/author5.txt")

### I define my subsets with numeric vectors

women <- c(1,3)
men <- c(2,4,5)

### This creates new character vectors containing the file paths
women.files <- study.files[women]
men.files <- study.files[men]

### Here are the things I've tried to create a corpus from the subsetted list. None of these work.

women_corpus <- Corpus(women.files)
women_corpus <- Corpus(DirSource(women.files))
women_corpus <- Corpus(DirSource(unlist(women.files)))

The subsets I need to create are rather elaborate, so I can't easily make new folders containing only the text files of interest for each corpus.

Answer 1

This is working as you wish i think.

pathname <- c("C:/data/test")

study.files <- list.files(path = pathname, pattern = NULL, all.files = T, full.names = T, recursive = T, ignore.case = T, include.dirs = F) 

### This gives me a character vector that is equivalent to:

study.files <- c("C:/data/test/test1/test1.txt",
                 "C:/data/test/test2/test2.txt",
                 "C:/data/test/test3/test3.txt")

### I define my subsets with numeric vectors

women <- c(1,3)
men <- c(2)

### This creates new character vectors containing the file paths
women.files <- study.files[women]
men.files <- study.files[men]

### Here are the things I've tried to create a corpus from the subsetted list. None of these work.

women_corpus <- NULL
nedir <- lapply(women.files, function (filename) read.table(filename, sep="\t", stringsAsFactors = F))
hepsi <- lapply( nedir, function(x) x$V1)
women_corpus <- Corpus(VectorSource(hepsi))

Answer 2

I had a similar problem where I was clustering documents based on their cosine similarity, and I wanted to analyse the individual clusters separately but didn't want to have to organise the documents into separate folders.

Looking at the documentation for DirSource there is an option to pass in a regular expression pattern “Only file names which match the regular expression will be returned”, so I used the clustering information to group the documents and construct a regex pattern for each cluster.

Using the example above you could use a similar approach:

library(tidyverse)
library(tm)

study.files <- c(
  "C:/Desktop/Samples/author1.txt"
  ,"C:/Desktop/Samples/author2.txt"
  ,"C:/Desktop/Samples/author3.txt"
  ,"C:/Desktop/Samples/author4.txt"
  ,"C:/Desktop/Samples/author5.txt"
)

### I define my subsets with numeric vectors

women <- c(1,3)
men <- c(2,4,5)

# putting this into a data.frame
doc_df <- data.frame(document = study.files) %>% 
  # categoris each of the documents using the numeric vectors 
  # defined above, as per original example
  mutate(
    index = row_number()
    , gender = if_else(index %in% women, 'woman', 'man')
    # separate the file name from the full path
    , filename = basename(as.character(document))
    ) %>% 
  group_by(gender) %>%
  # build the regex select pattern
  mutate(select_pattern = str_replace_all(paste0(filename, collapse = '|'), '[.]', "[.]")) %>%
  summarise(select_pattern = first(select_pattern))
  
men_df <- doc_df %>% filter(gender == 'man')
woman_df <- doc_df %>% filter(gender == 'woman')

# you can then use this to load a subset of documents from a single directory using regex
men_corpus <- Corpus(DirSource("C:/Desktop/Samples/", pattern = men_df$select_pattern[1]))
woman_corpus <- Corpus(DirSource("C:/Desktop/Samples/", pattern = woman_df$select_pattern[1]))

Create a Corpus from a List of File Paths in R

Question

2 answers

solution1
1 ACCPTED 2016-03-17 12:54:36

solution2
0 2020-06-30 11:45:24

Create a Corpus from a List of File Paths in R

Question

2 answers

solution1 1 ACCPTED 2016-03-17 12:54:36

solution2 0 2020-06-30 11:45:24

solution1
1 ACCPTED 2016-03-17 12:54:36

solution2
0 2020-06-30 11:45:24