简体繁体 English

LSA相似性接口

[英]LSA Similarity interface

原文 2014-12-21 05:47:35 3 1 lsa/ latent-semantic-indexing/ latent-semantic-analysis

I am a PhD student in translation studies and I am currently working on my dissertation. 我是翻译研究的博士学位学生，目前正在研究我的论文。 I am using LSA Similarity interface as a method of analysis in my dissertation. 本文将LSA相似性接口作为一种分析方法。 My background is in linguistics and not computer science. 我的背景是语言学而不是计算机科学。 I tried to find an easy LSA document categorisation tool but I could not find any. 我试图找到一种简单的LSA文档分类工具，但找不到任何工具。 I tried to play with Gensim, I did not work. 我尝试与Gensim一起玩，但我没有工作。 I think my problem is with linking my corpus (txt files) with the Gensim tool to do the analysis (I don't know how o do this step). 我认为我的问题是将我的语料库（txt文件）与Gensim工具链接以进行分析（我不知道该怎么做）。 I would greatly appreciate if anyone could help me with the analysis or direct me to any tool or easy tutorials to do it using Gensim. 如果有人可以帮助我进行分析或指导我使用Gensim进行分析的任何工具或简单教程，我将不胜感激。

I want to do the following: I want to apply document-doecument queries to retrieve the most relevant 5 documents from the corpus to the query document. 我要执行以下操作：我想应用文档文档查询来从语料库中检索到最相关的5个文档到查询文档。

I have 15 query document 我有15个查询文件
I have one corpus of (150 texts)The texts are short stories 我有一个语料库（150个文本）这些文本是短篇小说

I am desperate and I was hesitant to post this question here. 我很拼命，不愿在这里发表这个问题。 I am sure that applying LSA in translation studies would add to the field and this makes me more persistent to find a way to do my analysis. 我确信在翻译研究中应用LSA将会增加这一领域，这使我更加执着地寻求一种进行分析的方法。

1 个解决方案

The only really easy, user-friendly tool for LSA that is out there right now is http://lsa.colorado.edu/ . 目前，对于LSA来说，唯一真正易于使用的用户友好工具是http://lsa.colorado.edu/ 。 Unfortunately, it is a web-based tool only, and it does not allow you to train LSA on your own corpora. 不幸的是，它只是一个基于Web的工具，它不允许您在自己的语料库上训练LSA。 But depending on your needs, that may not matter. 但是根据您的需求，这可能并不重要。

If I'm understanding you correctly, you need document-document similarity scores between each of 15 query documents and each of 150 short stories (a total of 15*150=2250 similarity scores). 如果我对您的理解正确，那么您需要15个查询文档中的每一个与150个简短故事中的每一个之间的文档-文档相似性评分（共15 * 150 = 2250个相似性评分）。 If these query documents and short stories are in English, then you can use the version of LSA that is trained on the TASA corpus used in many studies of LSA as follows: 如果这些查询文档和短篇小说是英文的，则可以使用在许多LSA研究中使用的TASA语料库上训练的LSA版本，如下所示：

Go to http://lsa.colorado.edu/ 前往http://lsa.colorado.edu/
Select One-To-Many Comparison 选择一对多比较
Copy-paste one of the short stories in the "Main text" box, and the 15 queries separated with a blank line in the "Texts to compare" box 复制“主文本”框中的一个简短故事，然后在“要比较的文本”框中将15个查询用空白行分隔。
Repeat for each of your short stories. 对每个短篇小说重复一遍。 A huge pain? 痛苦不堪？ Yes. 是。 But if you are desperate... 但是如果你拼命...

If you program a little bit in Python or R, other tools for LSA include http://clic.cimec.unitn.it/composes/toolkit/introduction.html and http://cran.r-project.org/web/packages/lsa/lsa.pdf , and would save you the manual labor of the above suggestion. 如果您使用Python或R进行一些编程，则LSA的其他工具包括http://clic.cimec.unitn.it/composes/toolkit/introduction.html和http://cran.r-project.org/web/ packages / lsa / lsa.pdf ，并且可以节省上述建议的体力劳动。 Also, I know you already tried Gensim, but there is a nice tutorial for it at http://radimrehurek.com/gensim/tutorial.html that you might try following if you haven't already. 另外，我知道您已经尝试过Gensim，但是http://radimrehurek.com/gensim/tutorial.html上有一个不错的教程，如果您还没有尝试过，可以尝试。