（胡萝卜）聚类如何在 solr 中工作？

Question

i have running Lucene/Solr 4 for testing different features, also "clustering".我已经运行 Lucene/Solr 4 来测试不同的特性，也就是“集群”。 Currently, 1 million documents are indexed.目前，有 100 万份文档被索引。 Every document has the following fields:每个文档都有以下字段：

ID (unique Key) Example1: 10245
               Example2: 24974
TOPIC (Keywords of the document) Example1: "disaster/japan/nuclear power station"
                                 Example2: "world/japan/nuclear power"
HEADLINE (1 line of text): Example1: "explosion at nuclear power plant in japan"
                           Example2: "news about japans nuclear power plant"
TEXT (the full text): "In the Japanese nuclear power plant in Fukushima..."

All the fields are indexed and stored, exapt TEXT, which is only indexed, not stored.所有字段都被索引和存储，例如 TEXT，它只被索引，不被存储。 I use the following specific configuration:我使用以下具体配置：

  <str name="carrot.title">TOPIC</str>
   <str name="carrot.snippet">HEADLINE</str>

If you looking the example you see, that the TOPIC is different, but japan is the same.如果你看你看到的例子，主题是不同的，但日本是一样的。 Is it possible to configure solr/carrot in that way, that example1 and example2 will be in one cluster?是否可以以这种方式配置 solr/carrot，example1 和 example2 将在一个集群中？ Because of the matching "japan"?!因为匹配“日本”？！

Further there could be an 3rd TOPIC like "news/nuclear power", no "japan" inside but HEADLINE and TEXT are using the words: japans power plant.此外，可能还有第三个主题，如“新闻/核能”，里面没有“日本”，但标题和文本使用的词是：日本发电厂。 What solr/carrot configuration is relevant in order to receive those 3 news in one cluster?为了在一个集群中接收这 3 个消息，什么 solr/carrot 配置是相关的？

Thank you!谢谢！

Answer 1

Carrot2 is designed to cluster natural / unstructured text and such algorithms will very rarely produce results that a human would find perfect. Carrot2 旨在对自然/非结构化文本进行聚类，此类算法很少会产生人类认为完美的结果。 Unfortunately, such algorithms are also hard to "debug" -- the clusters they produce depend on many factors, such as the frequencies with which words occur in your documents.不幸的是，这样的算法也很难“调试”——它们产生的集群取决于许多因素，例如单词在文档中出现的频率。 In your specific example, the word Japan may not have been chosen to form a cluster because it's too frequent -- it appears in all of the documents you quoted.在您的具体示例中，可能没有选择“日本”这个词来形成一个集群，因为它太频繁了——它出现在您引用的所有文件中。

Here are a few tips you may want to try to tweak the clusters:以下是您可能想要尝试调整集群的一些提示：

Try separating keywords with a period followed by a space rather than a slash, eg "disaster. japan. nuclear power station".尝试用句点后跟空格而不是斜线来分隔关键字，例如“灾难。日本。核电站”。 If you do that, Carrot2 will treat word sequences, such as "nuclear power station", as phrases rather than individual words.如果您这样做，Carrot2 会将单词序列（例如“nuclear power station”）视为短语而不是单个单词。
Try a different Carrot2 clustering algorithm, eg STC.尝试不同的 Carrot2 聚类算法，例如 STC。
If there is a chance to get your full story text field stored (or maybe part of it, such as the first paragraph), use the HEADLINE for carrot.title and the full text / excerpt for carrot.snippet.如果有机会存储您的全文文本字段（或者可能是其中的一部分，例如第一段），请使用胡萝卜.title 的标题和胡萝卜.snippet 的全文/摘录。
Play with the specific settings of Carrot2 algorithms.使用 Carrot2 算法的特定设置。 The best tool for this would be Carrot2 Clustering Workbench.最好的工具是 Carrot2 Clustering Workbench。 Here's how to connect it to Solr: http://wiki.apache.org/solr/ClusteringComponent#Tuning_Carrot2_clustering以下是将其连接到 Solr 的方法： http://wiki.apache.org/solr/ClusteringComponent#Tuning_Carrot2_cluster

（胡萝卜）聚类如何在 solr 中工作？

问题描述

1 个解决方案

解决方案1
4 已采纳 2011-07-14 10:38:28

（胡萝卜）聚类如何在 solr 中工作？

问题描述

1 个解决方案

解决方案1 4 已采纳 2011-07-14 10:38:28

解决方案1
4 已采纳 2011-07-14 10:38:28