使用Quanteda计算R中大型语料库的余弦相似度

Question

I am trying to work with a very large corpus of about 85,000 tweets that I'm trying to compare to dialog from television commercials. 我正在尝试处理约85,000条推文的超大型语料库，并将其与电视广告中的对话进行比较。 However, due to the size of my corpus, I am unable to process the cosine similarity measure without getting the "Error: cannot allocate vector of size n" message, (26 GB in my case). 但是，由于我的主体大小，我无法处理余弦相似性度量，而不会收到“错误：无法分配大小为n的向量”消息（在我的情况下为26 GB）。

I am already running R 64 bit on a server with lots of memory. 我已经在具有大量内存的服务器上运行R 64位。 I've also tried using the AWS on the server with the most memory, (244 GB), but to no avail, (same error). 我还尝试在具有最大内存（244 GB）但无济于事的服务器上使用AWS（相同的错误）。

Is there a way to use a package like fread to get around this memory limitation or do I just have to invent a way to break up my data? 是否可以使用诸如fread之类的程序包来解决此内存限制，还是我只需要发明一种分解数据的方法？ Thanks much for the help, I've appended the code below: 非常感谢您的帮助，我在下面附加了代码：

x <- NULL
y <- NULL
num <- NULL
z <- NULL
ad <- NULL
for (i in 1:nrow(ad.corp$documents)){
  num <- i
  ad <- paste("ad.num",num,sep="_")
  x <- subset(ad.corp, ad.corp$documents$num== yoad)
  z <- x + corp.all
  z$documents$texts <- as.character(z$documents$texts)
  PolAdsDfm <- dfm(z, ignoredFeatures = stopwords("english"), groups = "num",stem=TRUE, verbose=TRUE, removeTwitter=TRUE)
  PolAdsDfm <- tfidf(PolAdsDfm)
  y <- similarity(PolAdsDfm, ad, margin="documents",n=20, method = "cosine", normalize = T)
  y <- sort(y, decreasing=T)
  if (y[1] > .7){assign(paste(ad,x$documents$texts,sep="--"), y)}
  else {print(paste(ad,"didn't make the cut", sep="****"))}  
}

Answer 1

The error was most likely caused by previous versions of quanteda (before 0.9.1-8, on GitHub as of 2016-01-01) that coerced dfm object into dense matrixes in order to call proxy::simil(). 该错误很可能是由于之前版本的Quanteda（在2016年1月1日在GitHub上的0.9.1-8之前）将dfm对象强制转换为密集矩阵以便调用proxy :: simil（）引起的。 The newer version now works on sparse dfm objects without coercion for method = "correlation" and method = "cosine" . 较新的版本现在可以在method = "correlation"和method = "cosine"情况下在没有强制性的稀疏dfm对象上使用。 (With more sparse methods to come soon.) （即将推出更多稀疏方法。）

I can't really follow what you are doing in the code, but it looks like you are getting pairwise similarities between documents aggregated as groups. 我无法真正遵循您在代码中所做的工作，但是看起来您正在聚合成组的文档之间出现成对相似性。 I would suggest the following workflow: 我建议以下工作流程：

Create your dfm with the groups option for all groups of texts you want to compare. 使用groups选项为要比较的所有文本组创建dfm。
Weight this dfm with tfidf() as you have done. 完成后，用tfidf()对该dfm加权。
Use y <- textstat_simil(PolAdsDfm, margin = "documents", method = "cosine") and then coerce this to a full, symmetric matrix using as.matrix(y) . 使用y <- textstat_simil(PolAdsDfm, margin = "documents", method = "cosine") ，然后使用as.matrix(y)将其强制为完整的对称矩阵。 All of your pairwise documents are then in that matrix, and you can select on the condition of being greater than your threshold of 0.7 directly from that object. 然后，所有成对文档都在该矩阵中，并且可以直接从该对象中选择大于阈值0.7的条件。
Note that there is no need to normalise term frequencies with method = "cosine" . 请注意，无需使用method = "cosine"来标准化术语频率。 In newer versions of quanteda , the normalize argument has been removed anyway, since I think it's a better workflow practice to weight the dfm prior to any computation of similarities, rather than building weightings into textstat_simil() . 在较新版本的Quanteda中 ，无论如何都删除了normalize参数，因为我认为在进行相似度计算之前对dfm进行加权是一种更好的工作流程实践，而不是将加权构建到textstat_simil() 。

Final note: I strongly suggest not accessing the internals of a corpus object using the method you have here, since those internals may change and then break your code. 最后说明：我强烈建议不要使用此处提供的方法访问corpus对象的内部结构，因为这些内部结构可能会更改，然后破坏您的代码。 Use texts(z) instead of z$documents$texts , for instance, and docvars(ad.corp, "num") instead of ad.corp$documents$num . 例如，使用texts(z)代替z$documents$texts ，并使用docvars(ad.corp, "num")代替ad.corp$documents$num 。

使用Quanteda计算R中大型语料库的余弦相似度

问题描述

1 个解决方案

解决方案1
3 已采纳 2016-01-01 19:08:54

使用Quanteda计算R中大型语料库的余弦相似度

问题描述

1 个解决方案

解决方案1 3 已采纳 2016-01-01 19:08:54

解决方案1
3 已采纳 2016-01-01 19:08:54