简体繁体 English

使用HBase获取数据以使用Mahout计算文本相似度

[英]Using HBase to fetch data to calculate Text Similarities using Mahout

原文 2012-05-18 10:36:59 7 1 java/ hadoop/ hbase/ mahout/ hadoop-streaming

In my Project we are trying to calculate the Text Similarity of a set of documents for which I am facing 2 issues. 在我的项目中，我们正在尝试计算一组文档的文本相似度，我面临两个问题。

I do not want to recalculate the Term Frequency of the documents I have previously calculated. 我不想重新计算我之前计算过的文件的期限频率。 eg I have 10 docs and I have calculated the Term Frequency and Inverse Document Frequency for all the 10 documents. 例如，我有10个文档，我已经为所有10个文档计算了术语频率和反向文档频率。 Then I get 2 more documents. 然后我再收到2份文件。 Now I do not want to calculate the Term Frequency for the already existing 10 documents but want to calculate the TF for the new 2 which have come in and then use the TF's for all the 12 documents and calculate the IDF for the 12 documents as a whole. 现在我不想计算已经存在的10个文档的术语频率，但是想要计算新2的TF，然后对所有12个文档使用TF，并计算12个文档的IDF作为整个。 How to calculate the IDF of all the documents without re-calculating the TF's of the existing docs again? 如何在不重新计算现有文档的TF的情况下计算所有文档的IDF？
The number of documents might increase which means using the in memory approach (InMemoryBayesDatastore) might become cumbersome. 文档数量可能会增加，这意味着使用内存方法（InMemoryBayesDatastore）可能会变得很麻烦。 What I want is to save the TF of all the documents in an HBASE table and when new documents arrive, I calculate the TF of the new documents, save them in the HBASE table and then I use this HBASE table to fetch the TF of all the documents to calculate the IDF. 我想要的是将所有文件的TF保存在HBASE表中，当新文件到达时，我计算新文档的TF，将它们保存在HBASE表中，然后我使用这个HBASE表来获取所有文件的TF计算IDF的文件。 How can I use HBase to provide data to Mahout's Text Similarity instead of fetching it from the sequence file? 如何使用HBase向Mahout的Text Similarity提供数据，而不是从序列文件中提取数据？

1 个解决方案

I assume in your MR job you are reading form HDFS and outputting to Hbase. 我假设在您的MR工作中，您正在阅读HDFS并输出到Hbase。 What I suggest, if I understand your problem correctly, is to calculate the TF for each document and store the Term as the rowkey, the qualifier can be the documentID, and a value can be the frequency (just a suggestion for your schema). 如果我正确理解你的问题，我建议的是计算每个文档的TF并将Term存储为rowkey，限定符可以是documentID，值可以是频率（只是对模式的建议）。 You will have to do 1 MR job for each document, and you will only have to run the job once per document. 您必须为每个文档执行1个MR作业，并且每个文档只需运行一次作业。

Do this for each document you are analyzing as they arrive. 对分析到的每个文档执行此操作。

Then run an a final MR job to compare all the documents one a per term (ie per row) basis. 然后运行最终的MR作业，以每个术语（即每行）为基础比较所有文档。 This will work for specific terms, but would get complicated with 'similar terms'. 这将适用于特定条款，但会因“类似条款”而变得复杂。 Then you'd want to run some sort of algorithm that would take into account perhaps the Levenshtein distance between terms, which can be complicated. 然后你想要运行某种算法，这种算法可能考虑到术语之间的Levenshtein距离，这可能很复杂。