简体   繁体   中英

Tuning Sphinx search for product search

We have a very simple product catalog that stores products inside mysql table and we need to build a quality search for products that should work as fast as possible (and as relevant as possible). The products database will be quite large (about 500.000 products) which is why searches using "like" that are not using indexes are very slow. We already tried using mysql fulltext search that worked quick but did not produce a satisfying results especially for searches using numbers (such as "LR-41" which is a battery type etc.).

Our products catalog includes many fields but the only ones we need to search on are:

product_id = bigint
title = varchar(255)
description = text

After many suggestions we finally tried using Sphinx search and made a config like:

source mysearch {
  type=mysql
  sql_host=...
  sql_user=...
  sql_pass=...
  sql_port=...
  sql_query_pre = SET NAMES utf8
  sql_query = SELECT product_id, title, description FROM products
  sql_query_info = SELECT * FROM products WHERE product_id=$id 
}

index fulltext { 
    source  = mysearch
    path = /var/lib/sphinxsearch/data/mysearch
    docinfo = extern
    mlock = 0
    morphology = stem_en, metaphone
    min_word_len = 1
    blend_chars = +, &, U+23, -
    blend_mode = trim_both
    html_strip = 1 
}

indexer {
    mem_limit = 256M 
}

searchd {
    listen = 9312 
    # everything else set to default
}

For website backend we use PHP and we use the following code:

<?php
$sphinx = new SphinxClient();
$sphinx->SetServer('localhost', 9312);
$sphinx->SetMatchMode(SPH_MATCH_EXTENDED);
$sphinx->setFieldWeights(array(
    'product_id' => 10,
    'title' => 7,
    'description' => 3
));
$sphinx->setLimits(0, 200, 1000, 5000);
$sphinx->SetRankingMode(SPH_RANK_PROXIMITY_BM25);
$sphinx->AddQuery($_GET['query'], "fulltext");
$results = $sphinx->RunQueries();
print_r($results);
?>

This is just a demo script to test search but it returns a totally wrong results whatever I use for query - it matches products that don't even include a word (or a substring) I am searching for.

Here are the rules what I want to achieve:

  • if query matches the "product_id" the product should be ranked the highest (some frequent users know product_id and want to search by it)
  • if query is "Meter XY-123" it should match all the products that contain both or any of these words (naturally products that contain both strings should be ranked higher)
  • if query is found in title it should be ranked higher than if it is found in description
  • if someone searches for "XY-123" it should produce the same results as if he searches for "XY123" or "XY 123"
  • it shoud search also for substrings - eg if product's title is "Foobar 123" it should be returned even if user searches for "foo bar 123", "bar 123", "foobar 12", "foo" etc.
  • results should also be returned ordered by some kind of relevance.. eg if I have two products "foobar 123" and "foobar 456" and user searches for "foobar 4" then both products should be returned (match any word) but second product should be ranked higher (because it contains also number 4) than the first one (that doesn't contain number 4).
  • products should also be ranked based on which field the value is found in. In this case product_id field has bigger weight than title which has also higher weight than description.

So the question is - how to correctly configure and use sphinx + php to produce the search results meething the criteria above?

thank you!

This is just a demo script to test search but it returns a totally wrong results whatever I use for query

Suggest removing metaphone from morphology . That specifically enables 'fuzzy' indexing - sort of like 'sound alike'. But it DOESNT play well combined with stemming (ie stem_en) - leads to very confusing results.

In fact you could perhaps remove stemming too, if you setting up prefix indexing (see below) - there are difficult to detect edge cases if try and use both.


if query matches the "product_id" the product should be ranked the highest (some frequent users know product_id and want to search by it)

Sphinx doesnt include the product id in the 'full-text' index. You would need to duplicate it.

sql_query = SELECT product_id as id, product_id, name,...  

if query is "Meter XY-123" it should match all the products that contain both or any of these words (naturally products that contain both strings should be ranked higher)

That means you want to do a 'ANY' search. Sphinx defaults to 'ALL' searches. Either change to SPH_MATCH_ANY, or rewrite the query to make it 'ANY' (injecting '|' between words or using quorum)


if someone searches for "XY-123" it should produce the same results as if he searches for "XY123" or "XY 123"

Thats very trickly. You can try adding - to blend_chars . Which will sort of work. But entering say "XY 123" will not match a product with "XY123". I dont think there is a easy solution to this.

There are all sort of statistical methods to try to rewrite the query, but by their nature will be imprecise.


it shoud search also for substrings - eg if product's title is "Foobar 123" it should be returned even if user searches for "foo bar 123", "bar 123", "foobar 12", "foo" etc.

Would need to use min_prefix_len to enable prefix searches, AND set enable_star = 0

But enable_star=0 is depereciated, so perhaps could use expand_keywords=1 instead which will automatically add the star for you.


results should also be returned ordered by some kind of relevance

In general that will happen. But can try changing the ranking mode if want - there are many options (But needs use extended match mode)


products should also be ranked based on which field the value is found in.

setFieldWeights to the rescure! (you've already got that!)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM