简体   繁体   English

使用 readtext 和 quanteda 制作语料库的正确方法是什么?

[英]What is the right way to make corpus with readtext and quanteda?

I need some help.我需要一些帮助。 I'm trying to make some corpus samples using the quanteda package, but it doesn't work as expected.我正在尝试使用 quanteda package 制作一些语料库样本,但它没有按预期工作。

library(quanteda)
library(readtext)

news <- corpus(readtext('./final/en_US/en_US.news.txt', dvsep = ' '))
#Yeah, it's from Coursera

And then I try to take a sample from the whole corpus:然后我尝试从整个语料库中抽取样本:

set.seed(362)
newsSample <- corpus_sample(news, size = 5000)

R-studio says me, that it Cannot take a sample larger than the population , but I'm sure that the population is much more than size, file has about 77k lines. R-studio 告诉我,它不能采样大于总体的样本,但我确信总体远大于大小,文件大约有 77k 行。 One more thing, after readtext I got the matrix with 1 obs.还有一件事,在readtext之后,我得到了 1 obs 的矩阵。 of 2 variables. 2个变量。 The second var is the whole text from file.第二个 var 是文件中的整个文本。

What am I doing wrong?我究竟做错了什么?

You only have 1 document in the corpus when using readtext to read in a single document.使用readtext读取单个文档时,语料库中只有 1 个文档。 There might be 77k lines in the document, but it comes only from 1 document, not 77k documents.文档中可能有 77k 行,但它仅来自 1 个文档,而不是 77k 文档。 If you check the outcome of readtext you will see only 1 value in the column doc_id, and all the text would be in a single cell of the text column.如果您检查readtext的结果,您将在 doc_id 列中只看到 1 个值,并且所有文本都将位于文本列的单个单元格中。 See the differences in the example below.请参阅下面示例中的差异。

library(readtext)
library(quanteda)
DATA_DIR <- system.file("extdata/", package = "readtext")

rt2 <- readtext(paste0(DATA_DIR, "/txt/EU_manifestos/EU_euro_2004_de_PSE.txt"),
                docvarsfrom = "filenames", 
                docvarnames = c("unit", "context", "year", "language", "party"),
                encoding = "LATIN1")
rt2
readtext object consisting of 1 document and 5 docvars.
# Description: df[,7] [1 x 7]
  doc_id                  text                unit  context  year language party
  <chr>                   <chr>               <chr> <chr>   <int> <chr>    <chr>
1 EU_euro_2004_de_PSE.txt "\"PES · PSE \"..." EU    euro     2004 de       PSE  

my_corp <- corpus(rt2)
Corpus consisting of 1 document and 5 docvars.
EU_euro_2004_de_PSE.txt :
"PES · PSE · SPE European Parliament rue Wiertz B 1047 Brusse..."

and

rl1 <- readLines(paste0(DATA_DIR, "/txt/EU_manifestos/EU_euro_2004_de_PSE.txt"))
           
my_corp_rl1 <- corpus(rl1)
my_corp_rl1
Corpus consisting of 100 documents.
text1 :
"PES · PSE · SPE European Parliament rue Wiertz B 1047 Brusse..."

text2 :
""

text3 :
"GEMEINSAM WERDEN WIR STÄRKER Fünf Verpflichtungen für die nä..."

text4 :
"Manifest der Sozialdemokratischen Partei Europas für die Wah..."

text5 :
"PARTY OF EUROPEAN SOCIALISTS · Tel +32 2 284 29 76 · Fax +32..."

text6 :
""

[ reached max_ndoc ... 94 more documents ]

Using readLines and then corpus, will create a corpus with 100 documents, but these are just the lines that were just read in and that is not a correct definition of a corpus.使用readLines然后使用 corpus,将创建一个包含 100 个文档的语料库,但这些只是刚刚读入的行,这不是语料库的正确定义。

corpus_sample samples the documents in the corpus. corpus_sample对语料库中的文档进行采样。 So if you have 100 documents in there, corpus_sample(my_corpus, 50) would sample 50 different documents.因此,如果您有 100 个文档, corpus_sample(my_corpus, 50)将采样 50 个不同的文档。

You need to check what kind of sampling you need to be done, documents or features.你需要检查你需要做什么样的采样,文件或特征。 If features, you need to use dfm_sample with margin = "features" .如果是功能,您需要使用dfm_samplemargin = "features" See the help in quanteda for more info.有关更多信息,请参阅 quanteda 中的帮助。 And if you need to do the sampling after text cleaning, removing stopwords etc etc.如果您需要在文本清理、删除停用词等之后进行采样。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM