简体   繁体   English

从 R 中的文件路径列表创建语料库

[英]Create a Corpus from a List of File Paths in R

I have 1030 individual .txt files in a directory which represent all the participants in a research study.我在一个目录中有 1030 个单独的 .txt 文件,代表研究中的所有参与者。

I have successfully created a corpus for use with the tm package in R out of all the files in the directory.我已经成功地从目录中的所有文件中创建了一个语料库,用于 R 中的 tm 包。

Now I'm trying to create corpi of numerous subsets of these files.现在我正在尝试创建这些文件的众多子集的 corpi。 For example, one corpus of all the female authors and one of the male authors.例如,一个所有女性作者的语料库和一个男性作者的语料库。

I was hoping to be able to pass the Corpus function subsets of a list of file paths, but this has not worked out.我希望能够传递文件路径列表的 Corpus 函数子集,但这还没有解决。

Any help is appreciated.任何帮助表示赞赏。 Here is an example to build from:这是一个构建示例:

pathname <- c("C:/Desktop/Samples")

study.files <- list.files(path = pathname, pattern = NULL, all.files = T, full.names = T, recursive = T, ignore.case = T, include.dirs = T) 

### This gives me a character vector that is equivalent to:

study.files <- c("C:/Desktop/Samples/author1.txt","C:/Desktop/Samples/author2.txt","C:/Desktop/Samples/author3.txt","C:/Desktop/Samples/author4.txt","C:/Desktop/Samples/author5.txt")

### I define my subsets with numeric vectors

women <- c(1,3)
men <- c(2,4,5)

### This creates new character vectors containing the file paths
women.files <- study.files[women]
men.files <- study.files[men]

### Here are the things I've tried to create a corpus from the subsetted list. None of these work.

women_corpus <- Corpus(women.files)
women_corpus <- Corpus(DirSource(women.files))
women_corpus <- Corpus(DirSource(unlist(women.files)))

The subsets I need to create are rather elaborate, so I can't easily make new folders containing only the text files of interest for each corpus.我需要创建的子集相当复杂,因此我无法轻松创建仅包含每个语料库感兴趣的文本文件的新文件夹。

This is working as you wish i think.这正如你所希望的那样工作。

pathname <- c("C:/data/test")

study.files <- list.files(path = pathname, pattern = NULL, all.files = T, full.names = T, recursive = T, ignore.case = T, include.dirs = F) 

### This gives me a character vector that is equivalent to:

study.files <- c("C:/data/test/test1/test1.txt",
                 "C:/data/test/test2/test2.txt",
                 "C:/data/test/test3/test3.txt")

### I define my subsets with numeric vectors

women <- c(1,3)
men <- c(2)

### This creates new character vectors containing the file paths
women.files <- study.files[women]
men.files <- study.files[men]

### Here are the things I've tried to create a corpus from the subsetted list. None of these work.

women_corpus <- NULL
nedir <- lapply(women.files, function (filename) read.table(filename, sep="\t", stringsAsFactors = F))
hepsi <- lapply( nedir, function(x) x$V1)
women_corpus <- Corpus(VectorSource(hepsi))

I had a similar problem where I was clustering documents based on their cosine similarity, and I wanted to analyse the individual clusters separately but didn't want to have to organise the documents into separate folders.我有一个类似的问题,我根据文档的余弦相似度对文档进行聚类,我想单独分析各个集群,但不想将文档组织到单独的文件夹中。

Looking at the documentation for DirSource there is an option to pass in a regular expression pattern “Only file names which match the regular expression will be returned”, so I used the clustering information to group the documents and construct a regex pattern for each cluster.查看 DirSource 的文档,有一个选项可以传入正则表达式模式“仅返回与正则表达式匹配的文件名”,因此我使用集群信息对文档进行分组并为每个集群构建一个正则表达式模式。

Using the example above you could use a similar approach:使用上面的示例,您可以使用类似的方法:

library(tidyverse)
library(tm)

study.files <- c(
  "C:/Desktop/Samples/author1.txt"
  ,"C:/Desktop/Samples/author2.txt"
  ,"C:/Desktop/Samples/author3.txt"
  ,"C:/Desktop/Samples/author4.txt"
  ,"C:/Desktop/Samples/author5.txt"
)

### I define my subsets with numeric vectors

women <- c(1,3)
men <- c(2,4,5)

# putting this into a data.frame
doc_df <- data.frame(document = study.files) %>% 
  # categoris each of the documents using the numeric vectors 
  # defined above, as per original example
  mutate(
    index = row_number()
    , gender = if_else(index %in% women, 'woman', 'man')
    # separate the file name from the full path
    , filename = basename(as.character(document))
    ) %>% 
  group_by(gender) %>%
  # build the regex select pattern
  mutate(select_pattern = str_replace_all(paste0(filename, collapse = '|'), '[.]', "[.]")) %>%
  summarise(select_pattern = first(select_pattern))
  
men_df <- doc_df %>% filter(gender == 'man')
woman_df <- doc_df %>% filter(gender == 'woman')

# you can then use this to load a subset of documents from a single directory using regex
men_corpus <- Corpus(DirSource("C:/Desktop/Samples/", pattern = men_df$select_pattern[1]))
woman_corpus <- Corpus(DirSource("C:/Desktop/Samples/", pattern = woman_df$select_pattern[1]))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM