简体   繁体   中英

keyword relevance PHP MySQL Search Engine

I don't know why I can't find this anywhere. I would think this would be pretty common request. I am writing a search engine in PHP to search a MySQL database of For Sale listings for keywords inputted by the user.

There are several columns in the table but only 2 that will need to be searched. They are named file_Title & file_Desc. Think of it like a classified ad. An item title and a description.

So for example a user would search for 'John Deere Lawn Tractor'. What I would like to happen is classifieds that have all 4 of those words show up at the top of the list. Then results that only have 3 an so on.

I've read a very good webpage at http://www.roscripts.com/PHP_search_engine-119.html

From that authors example I have the following code below:

<?php
    $search = 'John Deere Lawn Tractors';
    $keywords = split(' ', $search);

    $sql = "SELECT DISTINCT COUNT(*) As relevance, id, file_Title, file_Desc FROM Listings WHERE (";

    foreach ($keywords as $keyword) {
        echo 'Keyword is ' . $keyword . '<br />';
        $sql .= "(file_Title LIKE '%$keyword%' OR file_Desc LIKE '%$keyword%') OR ";
    }
    $sql=substr($sql,0,(strLen($sql)-3));//this will eat the last OR

    $sql .= ") GROUP BY id ORDER BY relevance DESC";
    echo 'SQL is ' . $sql;  

    $query = mysql_query($sql) or die(mysql_error());
    $Count = mysql_num_rows($query);
    if($Count != 0) {
                echo '<br />' . $Count . ' RESULTS FOUND';
        while ($row_sql = mysql_fetch_assoc($query)) {//echo out the results
            echo '<h3>'.$row_sql['file_Title'].'</h3><br /><p>'.$row_sql['file_Desc'].'</p>';
        }
    } else  {
        echo "No results to display";
    }

?>

The SQL String outputted is this:

 SELECT DISTINCT COUNT(*) As relevance, id, file_Title, file_Desc FROM Listings 
  WHERE ((file_Title LIKE '%John%'
    OR file_Desc LIKE '%John%')
    OR (file_Title LIKE '%Deere%' 
    OR file_Desc LIKE '%Deere%') 
    OR (file_Title LIKE '%Lawn%' 
    OR file_Desc LIKE '%Lawn%') 
    OR (file_Title LIKE '%Tractors%' 
    OR file_Desc LIKE '%Tractors%') ) 
 GROUP BY id 
 ORDER BY relevance DESC

With this code I get 275 results from my DB. My problem is it really doesn't order by the number of keywords found in the row. It seems to order the results by id instead. If I remove 'GROUP BY id' then it only returns 1 result instead of all of them, which is really messing with me!

I've also tried shifting to FULLTEXT in the db but can't seem to get that going either so I'd prefer to stick with LIKE %Keyword% syntax.

Any help is appreciated! Thanks!

I would suggest a totally different approach. Your approach is cumbersome, inefficient, heavy on the DB and will likely be very slow with more and more records added to your database.

What I would suggest is the following:

  1. Create a separate table for keywords.
  2. Create a list of non keywords you don't want to index (like the common English prepositions etc.) so that they are not included. You can probably find a list of them online, readily available.
  3. When a new entry is added, you split the string into separate keywords, omitting the ones in step 2., and inserting them in the table created in step 3 (if not already in it).
  4. In a separate table, with a foreign key pointing to the keywords table, associate the classifed_ad to the keyword.

Steps 3 and 4 must happen again if your classified_ad is edited (ie any keywords inserted in step 4 deleted from the association table and the keywords analysed again and reassociated with the classified ad).

Once you have this structure, all you have to do is search the association table and order by the number of matched keywords. You can even add an extra column to it and put the number of occurrences of that keyword in the article, so that you order by that too.

That will be much faster.

I had used a script once called Sphider which does something similar. Not sure if it is still maintained, but it works in a very similar way on web pages it parses.

I know you said you had problems with FULLTEXT , but I would highly encourage you to go back and try that again. FULLTEXT indexes and search is designed to do what you are doing, and when the MATCH command is used in the WHERE clause, MySQL automatically sorts the rows from highest to lowest relevance.

For more information on FULLTEXT, check out http://dev.mysql.com/doc/refman/5.0/en/fulltext-search.html

Also, pay special note to the comment by Patrick O'Lone on the same page, some of which is quoted below...

It should be noted in the documentation that IN BOOLEAN MODE will almost always return a relevance of 1.0. In order to get a relevance that is meaningful, you'll need to:

SELECT MATCH('Content') AGAINST ('keyword1 keyword2') as Relevance FROM table WHERE MATCH ('Content') AGAINST('+keyword1 +keyword2' IN BOOLEAN MODE) HAVING Relevance > 0.2 ORDER BY Relevance DESC

Notice that you are doing a regular relevance query to obtain relevance factors combined with a WHERE clause that uses BOOLEAN MODE. The BOOLEAN MODE gives you the subset that fulfills the requirements of the BOOLEAN search, the relevance query fulfills the relevance factor, and the HAVING clause (in this case) ensures that the document is relevant to the search (ie documents that score less than 0.2 are considered irrelevant). This also allows you to order by relevance.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM