简体   繁体   English

关键词相关性PHP MySQL搜索引擎

[英]keyword relevance PHP MySQL Search Engine

I don't know why I can't find this anywhere. 我不知道为什么我在任何地方都找不到这个。 I would think this would be pretty common request. 我认为这将是非常常见的要求。 I am writing a search engine in PHP to search a MySQL database of For Sale listings for keywords inputted by the user. 我正在用PHP编写一个搜索引擎来搜索用户输入的关键字的For Sale列表的MySQL数据库。

There are several columns in the table but only 2 that will need to be searched. 表中有几列,但只需要搜索2列。 They are named file_Title & file_Desc. 它们被命名为file_Title&file_Desc。 Think of it like a classified ad. 将其视为分类广告。 An item title and a description. 项目标题和说明。

So for example a user would search for 'John Deere Lawn Tractor'. 因此,例如用户将搜索“John Deere Lawn Tractor”。 What I would like to happen is classifieds that have all 4 of those words show up at the top of the list. 我想要发生的是那些将所有4个单词都显示在列表顶部的分类。 Then results that only have 3 an so on. 然后结果只有3个等等。

I've read a very good webpage at http://www.roscripts.com/PHP_search_engine-119.html 我在http://www.roscripts.com/PHP_search_engine-119.html上阅读了一个非常好的网页

From that authors example I have the following code below: 从该作者示例中,我有以下代码:

<?php
    $search = 'John Deere Lawn Tractors';
    $keywords = split(' ', $search);

    $sql = "SELECT DISTINCT COUNT(*) As relevance, id, file_Title, file_Desc FROM Listings WHERE (";

    foreach ($keywords as $keyword) {
        echo 'Keyword is ' . $keyword . '<br />';
        $sql .= "(file_Title LIKE '%$keyword%' OR file_Desc LIKE '%$keyword%') OR ";
    }
    $sql=substr($sql,0,(strLen($sql)-3));//this will eat the last OR

    $sql .= ") GROUP BY id ORDER BY relevance DESC";
    echo 'SQL is ' . $sql;  

    $query = mysql_query($sql) or die(mysql_error());
    $Count = mysql_num_rows($query);
    if($Count != 0) {
                echo '<br />' . $Count . ' RESULTS FOUND';
        while ($row_sql = mysql_fetch_assoc($query)) {//echo out the results
            echo '<h3>'.$row_sql['file_Title'].'</h3><br /><p>'.$row_sql['file_Desc'].'</p>';
        }
    } else  {
        echo "No results to display";
    }

?> ?>

The SQL String outputted is this: 输出的SQL字符串是这样的:

 SELECT DISTINCT COUNT(*) As relevance, id, file_Title, file_Desc FROM Listings 
  WHERE ((file_Title LIKE '%John%'
    OR file_Desc LIKE '%John%')
    OR (file_Title LIKE '%Deere%' 
    OR file_Desc LIKE '%Deere%') 
    OR (file_Title LIKE '%Lawn%' 
    OR file_Desc LIKE '%Lawn%') 
    OR (file_Title LIKE '%Tractors%' 
    OR file_Desc LIKE '%Tractors%') ) 
 GROUP BY id 
 ORDER BY relevance DESC

With this code I get 275 results from my DB. 使用此代码,我从我的DB获得275个结果。 My problem is it really doesn't order by the number of keywords found in the row. 我的问题是它确实没有按行中找到的关键字数量排序。 It seems to order the results by id instead. 它似乎通过id来排序结果。 If I remove 'GROUP BY id' then it only returns 1 result instead of all of them, which is really messing with me! 如果我删除'GROUP BY id'然后它只返回1个结果而不是所有结果,这真的让我感到困扰!

I've also tried shifting to FULLTEXT in the db but can't seem to get that going either so I'd prefer to stick with LIKE %Keyword% syntax. 我也试过转移到数据库中的FULLTEXT,但似乎无法做到这一点,所以我更喜欢坚持使用LIKE %Keyword%语法。

Any help is appreciated! 任何帮助表示赞赏! Thanks! 谢谢!

I would suggest a totally different approach. 我建议采用完全不同的方法。 Your approach is cumbersome, inefficient, heavy on the DB and will likely be very slow with more and more records added to your database. 您的方法繁琐,低效,对数据库很重,并且随着越来越多的记录添加到数据库中,可能会非常慢。

What I would suggest is the following: 我建议如下:

  1. Create a separate table for keywords. 为关键字创建单独的表。
  2. Create a list of non keywords you don't want to index (like the common English prepositions etc.) so that they are not included. 创建一个您不想索引的非关键字列表(如常用英语介词等),以便不包含它们。 You can probably find a list of them online, readily available. 您可以在线找到它们的列表,随时可用。
  3. When a new entry is added, you split the string into separate keywords, omitting the ones in step 2., and inserting them in the table created in step 3 (if not already in it). 添加新条目时,将字符串拆分为单独的关键字,省略步骤2中的关键字,并将它们插入到步骤3中创建的表中(如果尚未包含在其中)。
  4. In a separate table, with a foreign key pointing to the keywords table, associate the classifed_ad to the keyword. 在单独的表中,使用指向关键字表的外键,将classifed_ad与关键字相关联。

Steps 3 and 4 must happen again if your classified_ad is edited (ie any keywords inserted in step 4 deleted from the association table and the keywords analysed again and reassociated with the classified ad). 如果您的classified_ad已被编辑,则必须再次执行步骤3和4(即,从关联表中删除的步骤4中插入的任何关键字以及再次分析的关键字并与分类广告重新关联)。

Once you have this structure, all you have to do is search the association table and order by the number of matched keywords. 拥有此结构后,您所要做的就是搜索关联表并按匹配关键字的数量排序。 You can even add an extra column to it and put the number of occurrences of that keyword in the article, so that you order by that too. 您甚至可以向其添加一个额外的列,并将该关键字的出现次数放在文章中,以便您也可以按顺序排序。

That will be much faster. 那会更快。

I had used a script once called Sphider which does something similar. 我曾经使用过一个名为Sphider的脚本,它做了类似的事情。 Not sure if it is still maintained, but it works in a very similar way on web pages it parses. 不确定它是否仍然被维护,但它在它解析的网页上以非常类似的方式工作。

I know you said you had problems with FULLTEXT , but I would highly encourage you to go back and try that again. 我知道你说你有FULLTEXT问题,但我强烈建议你再回去尝试一下。 FULLTEXT indexes and search is designed to do what you are doing, and when the MATCH command is used in the WHERE clause, MySQL automatically sorts the rows from highest to lowest relevance. FULLTEXT索引和搜索旨在执行您正在执行的操作,并且在WHERE子句中使用MATCH命令时, MySQL会自动将行从最高与最低相关性排序。

For more information on FULLTEXT, check out http://dev.mysql.com/doc/refman/5.0/en/fulltext-search.html 有关FULLTEXT的更多信息,请查看http://dev.mysql.com/doc/refman/5.0/en/fulltext-search.html

Also, pay special note to the comment by Patrick O'Lone on the same page, some of which is quoted below... 此外,请特别注意Patrick O'Lone在同一页面上的评论,其中一些在下面引用...

It should be noted in the documentation that IN BOOLEAN MODE will almost always return a relevance of 1.0. 在文档中应该注意,IN BOOLEAN MODE几乎总是返回1.0的相关性。 In order to get a relevance that is meaningful, you'll need to: 为了获得有意义的相关性,您需要:

SELECT MATCH('Content') AGAINST ('keyword1 keyword2') as Relevance FROM table WHERE MATCH ('Content') AGAINST('+keyword1 +keyword2' IN BOOLEAN MODE) HAVING Relevance > 0.2 ORDER BY Relevance DESC SELECT MATCH('Content')AGAINST('keyword1 keyword2')as Relevance FROM表WHERE MATCH('Content')AGAINST('+ keyword1 + keyword2'IN BOOLEAN MODE)HAVING Relevance> 0.2 ORDER BY Relevance DESC

Notice that you are doing a regular relevance query to obtain relevance factors combined with a WHERE clause that uses BOOLEAN MODE. 请注意,您正在进行常规相关性查询以获取与使用BOOLEAN MODE的WHERE子句相关的相关因子。 The BOOLEAN MODE gives you the subset that fulfills the requirements of the BOOLEAN search, the relevance query fulfills the relevance factor, and the HAVING clause (in this case) ensures that the document is relevant to the search (ie documents that score less than 0.2 are considered irrelevant). BOOLEAN MODE为您提供满足BOOLEAN搜索要求的子集,相关性查询满足相关因子,HAVING子句(在本例中)确保文档与搜索相关(即得分低于0.2的文档)被认为是无关紧要的)。 This also allows you to order by relevance. 这也允许您按相关性排序。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM