[英]How does (carrot) clustering work in solr?
i have running Lucene/Solr 4 for testing different features, also "clustering".我已经运行 Lucene/Solr 4 来测试不同的特性,也就是“集群”。 Currently, 1 million documents are indexed.
目前,有 100 万份文档被索引。 Every document has the following fields:
每个文档都有以下字段:
ID (unique Key) Example1: 10245
Example2: 24974
TOPIC (Keywords of the document) Example1: "disaster/japan/nuclear power station"
Example2: "world/japan/nuclear power"
HEADLINE (1 line of text): Example1: "explosion at nuclear power plant in japan"
Example2: "news about japans nuclear power plant"
TEXT (the full text): "In the Japanese nuclear power plant in Fukushima..."
All the fields are indexed and stored, exapt TEXT, which is only indexed, not stored.所有字段都被索引和存储,例如 TEXT,它只被索引,不被存储。 I use the following specific configuration:
我使用以下具体配置:
<str name="carrot.title">TOPIC</str>
<str name="carrot.snippet">HEADLINE</str>
If you looking the example you see, that the TOPIC is different, but japan is the same.如果你看你看到的例子,主题是不同的,但日本是一样的。 Is it possible to configure solr/carrot in that way, that example1 and example2 will be in one cluster?
是否可以以这种方式配置 solr/carrot,example1 和 example2 将在一个集群中? Because of the matching "japan"?!
因为匹配“日本”?!
Further there could be an 3rd TOPIC like "news/nuclear power", no "japan" inside but HEADLINE and TEXT are using the words: japans power plant.此外,可能还有第三个主题,如“新闻/核能”,里面没有“日本”,但标题和文本使用的词是:日本发电厂。 What solr/carrot configuration is relevant in order to receive those 3 news in one cluster?
为了在一个集群中接收这 3 个消息,什么 solr/carrot 配置是相关的?
Thank you!谢谢!
Carrot2 is designed to cluster natural / unstructured text and such algorithms will very rarely produce results that a human would find perfect. Carrot2 旨在对自然/非结构化文本进行聚类,此类算法很少会产生人类认为完美的结果。 Unfortunately, such algorithms are also hard to "debug" -- the clusters they produce depend on many factors, such as the frequencies with which words occur in your documents.
不幸的是,这样的算法也很难“调试”——它们产生的集群取决于许多因素,例如单词在文档中出现的频率。 In your specific example, the word Japan may not have been chosen to form a cluster because it's too frequent -- it appears in all of the documents you quoted.
在您的具体示例中,可能没有选择“日本”这个词来形成一个集群,因为它太频繁了——它出现在您引用的所有文件中。
Here are a few tips you may want to try to tweak the clusters:以下是您可能想要尝试调整集群的一些提示:
Try separating keywords with a period followed by a space rather than a slash, eg "disaster. japan. nuclear power station".尝试用句点后跟空格而不是斜线来分隔关键字,例如“灾难。日本。核电站”。 If you do that, Carrot2 will treat word sequences, such as "nuclear power station", as phrases rather than individual words.
如果您这样做,Carrot2 会将单词序列(例如“nuclear power station”)视为短语而不是单个单词。
Try a different Carrot2 clustering algorithm, eg STC.尝试不同的 Carrot2 聚类算法,例如 STC。
If there is a chance to get your full story text field stored (or maybe part of it, such as the first paragraph), use the HEADLINE for carrot.title and the full text / excerpt for carrot.snippet.如果有机会存储您的全文文本字段(或者可能是其中的一部分,例如第一段),请使用胡萝卜.title 的标题和胡萝卜.snippet 的全文/摘录。
Play with the specific settings of Carrot2 algorithms.使用 Carrot2 算法的特定设置。 The best tool for this would be Carrot2 Clustering Workbench.
最好的工具是 Carrot2 Clustering Workbench。 Here's how to connect it to Solr: http://wiki.apache.org/solr/ClusteringComponent#Tuning_Carrot2_clustering
以下是将其连接到 Solr 的方法: http://wiki.apache.org/solr/ClusteringComponent#Tuning_Carrot2_cluster
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.