简体   繁体   English

Solr常用关键字/短语

[英]Solr common keyword/phrases

I am using Solr through PHP for searching all aspects of my site. 我通过PHP使用Solr来搜索我网站的所有方面。 I am trying to implement a feature and can't find any information on how to accomplish it. 我正在尝试实现一个功能,但无法找到有关如何完成它的任何信息。

I have a group of documents (reviews), each about a specific product. 我有一组文件(评论),每个文件都有一个特定的产品。

I want to find unique 1-2 word keywords (no stop words) that appear in multiple reviews for a single product, with a count for how many reviews they appear in. 我想找到单个产品的多个评论中出现的独特的1-2个单词关键字(无止损词),并计算它们出现的评论数量。

Once I have that, I want to show the top X keywords, number of reviews they are in, and a single top review for each one highlighted the use of the keyword. 一旦我有了这个,我想展示前X个关键词,他们所在的评论数量,以及每个关键字的使用突出显示的单个评论。

EDIT: 编辑:

Once I have a list of unique (non stop word/common words) keywords that appear in multiple reviews, I want to rank them by the number of times they appear across reviews. 一旦我有多个评论中出现的唯一(不间断单词/常用单词)关键字列表,我想根据它们在评论中出现的次数对它们进行排名。 For example, if people are writing reviews about cameras, the keywords might appear like this: 例如,如果人们正在撰写有关相机的评论,则关键字可能如下所示:

expensive (appears in 7 reviews) shutter speed (appears in 5 reviews) poor image (appears in 3 reviews) 贵(在7条评论中显示)快门速度(显示在5条评论),糟糕的图像(显示在3条评论)

Once I have those keywords ranked by number of reviews, I want to select 1 review per keyword and show those reviews highlighting the keyword. 根据评论数量对这些关键字进行排名后,我想为每个关键字选择1个评论,并显示突出显示该关键字的评论。 For example: 例如:

"... unfortunately this camera is far too EXPENSIVE for what you get ..." (in 7 reviews) "... the SHUTTER SPEED is far too slow for ..." (in 5 reviews) "... the POOR IMAGE quality is tis cameras biggest downfall ..." (in 3 reviews) “......不幸的是,这款相机对于你得到的东西来说太昂贵了......”(在7条评论中)“...... SHUTTER SPEED太慢了......”(在5条评论中)“...... POOR IMAGE质量是相机最大​​的垮台......“(3条评论)

As far as when to run this, I'm still not sure. 至于何时运行,我仍然不确定。 Possibly real time (when you view a product, then cached for X time), whenever a new review is posted, mark the product to be updated, or on a cronjob daily, etc. It will not be run against all keywords at one time, it will be run against all keywords in all reviews for a single product. 可能是实时(当您查看产品,然后缓存X时间),每当发布新评论时,标记要更新的产品,或每天在cronjob上标记等。它不会同时针对所有关键字运行,它将针对单个产品的所有评论中的所有关键字运行。 Then repeated for each product. 然后重复每个产品。

Hope that makes more sense. 希望更有意义。

Any help on how to accomplish this in Solr would be greatly appreciated. 任何有关如何在Solr中完成此任务的帮助将不胜感激。

听起来你正在寻找的是ShingleFilter 。你可以用它来制作unigrams / bigrams(可能带有一个copyfield)然后获取这些令牌的统计数据来生成你的界面。

This task is not particularly well suited to solr. 这项任务并不特别适合solr。 The only thing you gain from using solr is the stemming/stop word support which would be much faster if implemented in a local algorithm. 你使用solr获得的唯一东西是词干/停止词支持,如果在本地算法中实现,它会快得多。 I would create a new table in the database for "review_keyword" mapping reviews to keyword singletons and pairs. 我会在数据库中创建一个新表,用于“review_keyword”将评论映射到关键字单例和对。 When inserting a new review, also add a mapping to a separate row for each keyword in the review (this is where stemming/stop words kicks in). 在插入新评论时,还要为评论中的每个关键字添加映射到单独的行(这是词干/停止词语开始的地方)。 You can run a join select across this table when you want to lookup reviews for a product to get the top keywords in reviews for a product, and a review from that set. 如果要查找产品的评论以获取产品评论中的热门关键字,并从该组中进行评论,则可以在此表格中运行联接选择。 Depending on your usage, this would be better run on updates, rather than queries. 根据您的使用情况,这将更好地运行更新,而不是查询。

This looks like a job for a text parser rather than solr. 这看起来像文本解析器而不是solr的工作。 You will need a script probably in python (since it has good text parsing libs) that looks at all the words in the reviews and then gives you the top occurring words within each review (or) in all reviews with their counts. 您可能需要一个可能在python中的脚本(因为它具有良好的文本解析库),它可以查看评论中的所有单词,然后在每个评论(或)中为您提供最重要的单词。 Then you can index few words on either side of these top occurring words and create an abstract for your document (the product in this case) and index it in Solr to be returned as part of the search result. 然后,您可以在这些最常出现的单词的任一侧索引几个单词,并为您的文档(在本例中为产品)创建一个摘要,并在Solr中将其索引以作为搜索结果的一部分返回。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM