简体   繁体   English

过滤搜索查询的最佳方法是什么 - PHP MySQL

[英]What is the best way to filter search queries - PHP MySQL

Im building a site where users can search the posts. 我正在构建一个用户可以搜索帖子的网站。 Each post is stored inside a database. 每个帖子都存储在数据库中。 When a user makes a search eg iPad Mini FOR SALE the query will look like: 当用户进行搜索时,例如iPad Mini FOR SALE,查询将如下所示:

SELECT * FROM testtable WHERE title REGEXP 'iPad|Mini|FOR|SALE'

The query will then result to these 3 items: 然后查询将产生这3个项目:

  • Selling iPad Mini 卖iPad Mini
  • Selling iPad 卖iPad
  • Looking for authentic Gold Watches 寻找正宗的金表

The search was successful with the first two items but the third item really just doesn't belong in the group. 前两个项目的搜索成功,但第三个项目实际上不属于该组。 I want to filter it out and just show relevant items to the search. 我想过滤掉它,只是向搜索显示相关项目。 I'm thinking of taking out the common words like for, is, are, etc. but maybe you guys have any suggestions? 我正在考虑删除像for,is,are等常用词,但也许你们有什么建议吗?

Side note: Do you guys recommend "REGEXP"? 旁注:你们推荐“REGEXP”吗? I just saw it, used it, and didn't dive into it yet. 我刚刚看到它,使用它,并没有深入研究它。 (No need to answer this just the search filter question, but if you have some good info that would be great.) (不需要回答这个问题只是搜索过滤器问题,但是如果你有一些很好的信息会很棒。)

You should also take a look at FULLTEXT search. 你还应该看看FULLTEXT搜索。 In order to make FULLTEXT search work you need MyIsam MySQL table engine type there are innoDB support too but i don't know much about it. 为了使FULLTEXT搜索工作你需要MyIsam MySQL表引擎类型,也有innoDB支持,但我不太了解它。

Yes, remove common words 是的,删除常用词

These are called stop words . 这些被称为停用词 These are words that are generally irrelevant. 这些词通常是无关紧要的。

Consider relevance 考虑相关性

A post titled 'ipad mini for sale' is very relevant for a user searching [ipad mini for sale]. 标题为“ipad mini for sale”的帖子与搜索[ipad mini for sale]的用户非常相关。 A post titled 'ipad for sale' is less relevant. 一篇名为“ipad for sale”的帖子不太相关。 A post titled 'cheese factory for sale' is less relevant still. 一篇名为“奶酪工厂待售”的帖子仍然不太相关。

Consider deriving an algorithm for calculating what you deem relevant with respect to the posts on your site and the terms searched for by users. 考虑推导出一种算法,用于计算您认为与您网站上的帖子相关的内容以及用户搜索的字词。

The algorithm may be as simple as looking at the terms searched for and the occurrence of the terms in the post title. 该算法可以像查看搜索的术语和帖子标题中的术语的出现一样简单。 Are all terms searched for present in the title? 搜索的所有术语是否都出现在标题中? Probably very relevant. 可能非常相关。 Are 10% of the terms searched for present in the title? 是否有10%的条款在标题中出现? Probably very irrelevant. 可能非常无关紧要。

Consider how you want to calculate a relevance score. 考虑如何计算相关性分数。 Set a threshold below which results are deemed too irrelevant to be present in the results. 设置一个阈值,低于该阈值时,结果与结果中的结果无关。 From experience, I'd suggest setting the threshold quite high and aim for highly relevant results only, perhaps listing less relevant results only if no highly relevant results can be found 根据经验,我建议设置门槛相当高,仅针对高度相关的结果,可能只有在找不到高度相关的结果时才列出不太相关的结果

Use stemming 使用词干

As an aside, use stemming in your search. 另外,在搜索中使用词干。 A stemming algorithm will reduce a word down to a common stem. 词干算法会将单词缩减为常见词干。 You will search for the stem only not the full search term. 您将仅搜索词干而不是完整的搜索词。 Read up on stemming . 阅读词干 Find an implementation of the porter stemming algorithm for the language you are using; 找到您正在使用的语言的移植器词干算法的实现; it's a long-standing algorithm and from experience it's fast and pretty much good enough for most applications. 它是一种长期存在的算法,从经验来看,它对于大多数应用来说都是快速且非常好的。

REGEXP? REGEXP?

If you remove stop words and use a stem-based approach, this will be a less relevant concern. 如果删除停用词并使用基于词干的方法,这将是一个不太相关的问题。 In any case, it's a matter of implementation and is likely too subjective a matter to get you a meaningful answer. 无论如何,这是一个实施问题,并且可能过于主观,无法为您提供有意义的答案。 Try it, examine performance. 试试吧,检查一下表现。 Try another approach, examine performance. 尝试另一种方法,检查性能。 Use what works best for you. 使用最适合你的方法。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM