简体   繁体   English

使用R和Quanteda在大型语料库上计算n-gram

[英]Computing n-grams on large corpus using R and Quanteda

I am trying to build n-grams from a large corpus (object size about 1Gb in R) of text using the great Quanteda package. 我正在尝试使用伟大的Quanteda软件包从大型语料库(R中的对象大小约为1Gb)构建n-gram。 I don't have a cloud resource available, so I am using my own laptop (Windows and/or Mac, 12Gb RAM) to do the computation. 我没有可用的云资源,因此我正在使用自己的笔记本电脑(Windows和/或Mac,12Gb RAM)进行计算。

If I sample down the data into pieces, the code works and I get a (partial) dfm of n-grams of various sizes, but when I try to run the code on whole corpus, unfortunately I hit memory limits with this corpus size, and get the following error (example code for unigrams, single words): 如果我将数据采样成碎片,代码可以工作,并且我得到了一个(部分)大小为n-gram的dfm,但是当我尝试在整个语料库上运行代码时,不幸的是,我达到了这种语料库大小的内存限制,并得到以下错误(字母组合,单字的示例代码):

> dfm(corpus, verbose = TRUE, stem = TRUE,
      ignoredFeatures = stopwords("english"),
      removePunct = TRUE, removeNumbers = TRUE)
Creating a dfm from a corpus ...
... lowercasing
... tokenizing
... indexing documents: 4,269,678 documents
... indexing features: 
Error: cannot allocate vector of size 1024.0 Mb

In addition: Warning messages:
1: In unique.default(allFeatures) :
  Reached total allocation of 11984Mb: see help(memory.size)

Even worse if I try to build n-grams with n > 1: 更糟糕的是,如果我尝试用n> 1构建n-gram:

> dfm(corpus, ngrams = 2, concatenator=" ", verbose = TRUE,
     ignoredFeatures = stopwords("english"),
     removePunct = TRUE, removeNumbers = TRUE)

Creating a dfm from a corpus ...
... lowercasing
... tokenizing
Error: C stack usage  19925140 is too close to the limit

I found this related post , but it looks it was an issue with dense matrix coercion, later solved, and it doesn't help in my case. 我找到了这个相关的帖子 ,但是看起来这是一个密集矩阵强制的问题,后来解决了,对我的情况没有帮助。

Are there better ways to handle this with limited amount of memory, without having to break the corpus data into pieces? 有没有更好的方法可以用有限的内存来处理此问题,而不必将主体数据分解成碎片?

[EDIT] As requested, sessionInfo() data: [EDIT]根据要求,sessionInfo()数据:

> sessionInfo()
R version 3.2.3 (2015-12-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.9.6 dplyr_0.4.3      quanteda_0.9.4  

loaded via a namespace (and not attached):
 [1] magrittr_1.5    R6_2.1.2        assertthat_0.1  Matrix_1.2-3    rsconnect_0.4.2 DBI_0.3.1      
 [7] parallel_3.2.3  tools_3.2.3     Rcpp_0.12.3     stringi_1.0-1   grid_3.2.3      chron_2.3-47   
[13] lattice_0.20-33 ca_0.64

Yes there is, exactly by breaking it into pieces, but hear me out. 是的,确切地说,是将它分成几部分,但请听我说。 Instead of importing the whole corpus, import a piece of it (is it in files: then import file by file; is it in one giant txt file - fine, use readLines). 而不是导入整个语料库,而是导入其中的一部分(是否在文件中:然后按文件导入;是否在一个巨大的txt文件中-可以,使用readLines)。 Compute your n-grams, store them in another file, read next file/line, store n-grams again. 计算您的n-gram,将它们存储在另一个文件中,阅读下一个文件/行,再次存储n-gram。 This is more flexible and will not run into RAM issues (it will take quite a bit more space than the original corpus of course, depending on the value of n). 这更加灵活,不会遇到RAM问题(当然,要比原始语料库占用更多的空间,具体取决于n的值)。 Later, you can access the ngrams from the files as per usual. 以后,您可以照常从文件访问ngram。

Update as per comment. 根据评论更新。

As for loading, sparse matrices/arrays sounds like a good idea, come to think of it, it might be a good idea for storage too (particularly if you happen to be dealing with bigrams only). 对于加载,稀疏矩阵/数组听起来是个好主意,想一想,对于存储也可能是个好主意(特别是如果您碰巧只处理双字母组的话)。 If your data is that big, you'll probably have to look into indexing anyway (that should help with storage: instead of storing words in bigrams, index all words and store the index tuples). 如果数据如此之大,则无论如何您可能都必须考虑建立索引(这将有助于存储:不要将单词存储在bigrams中,而是对所有单词进行索引并存储索引元组)。 But it also depends what your "full n-gram model" is supposed to be for. 但这也取决于您的“完整n元语法模型”的用途。 If it's to look up the conditional probability of (a relatively small number of) words in a text, then you could just do a search (grep) over the stored ngram files. 如果要查找文本中(相对较少数量)单词的条件概率,则可以只对存储的ngram文件进行搜索(grep)。 I'm not sure the indexing overhead would be justified in such a simple task. 我不确定在这么简单的任务中索引开销是否合理。 If you actually need all the 12GB worth of ngrams in a model, and the model has to calculate something that cannot be done piece-by-piece, then you still need a cluster/cloud. 如果您实际上需要一个模型中所有12GB的ngram,并且该模型必须计算出无法逐段完成的功能,那么您仍然需要一个集群/云。

But one more general advice, one that I frequently give to students as well: start small. 但是,还有另外一个一般性建议,我也经常向学生提供这样的建议:从小做起。 Instead of 12BG, train and test on small subsets of the data. 代替12BG,对数据的一小部分进行训练和测试。 Saves you a ton of time while you are figuring out the exact implementation and iron out bugs - and particularly if you happen to be unsure about how these things work. 在确定确切的实现并消除错误时可以节省大量时间,尤其是如果您不确定这些事情如何工作时,尤其如此。

Probably too late now, but I had a very similar problem recently (n-grams, R, Quanteda and large text source). 现在可能为时已晚,但最近我遇到了一个非常相似的问题(n-gram,R,Quanteda和大型文本源)。 I searched for two days and could not find a satisfactory solution, posted on this forum and others and didn't get an answer. 我搜寻了两天,却找不到令人满意的解决方案,并在此论坛和其他论坛上发布,但没有得到答案。 I knew I had to chunk the data and combine results at the end, but couldn't work out how to do the chunking. 我知道我必须对数据进行分块并在最后合并结果,但无法解决如何进行分块的问题。 In the end I found a somewhat un-elegant solution that worked and answered my own question in the following post here 最后,我发现,在下面的帖子工作,并回答了我的问题有点未优雅的解决方案在这里

I sliced up the corpus using the 'tm' package VCorpus then fed the chunks to quanteda using the corpus() function. 我使用'tm'包VCorpus分割了语料库,然后使用corpus()函数将这些块喂给了Quanteda。

I thought I would post it as I provide the code solution. 我以为我会在提供代码解决方案时发布它。 Hopefully, it will prevent others from spending two days searching. 希望它将阻止其他人花两天的时间进行搜索。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM