简体   繁体   中英

How do you design a database to allow fast multicolumn searching?

I am creating a real estate search from RETS data using MySQL, but this is a general question. When you have a variety of columns that you would like the user to be able to filter their search result by, how do you optimize this?

For example, http://www.charlestonrealestateguide.com/listings.php has 16 or so optional filters. Granted, he only has up to 11,000 entries (I have the same data), but I don't imagine the search is performed with just a giant WHERE AND AND AND ... clause. Or is this typically accomplished with one giant multicolumn index?

Newegg, Amazon, and countless others also have cool & fast filtering systems for large amounts of data. How do they do it? And is there a database optimization reason for the tendency to provide ranges instead of empty inputs, or is that merely for user convenience?

I believe this post by Explain Extended addresses your question. It's long and detailed showing many examples. I'll cut/paste his summary to wet your appetite:

In some cases, a range predicate (like "less than", "greater than" or "between") can be rewritten as an IN predicate against the list of values that could satisfy the range condition.

Depending on the column datatype, check constraints and statistics, that list could be comprised of all possible values defined by the column's domain; all possible values defined by column's minimal and maximal value, or all actual distinct values contained in the table. In the latter case, a loose index scan could be used to retrieve the list of such values.

Since an equality condition is applied to each value in the list, more access and join methods could be used to build the query plain, including range conditions on secondary index columns, hash lookups etc.

Whenever the optimizer builds a plan for a query that contains a range predicate, it should consider rewriting the range condition as an IN predicate and use the latter method if it proves more efficient.

MySQL Edit

Seems that some RDBMS's have some capacity in this regards.

Mysql does have some index "joins" according to the documentation .

[Before MySQL5], MySQL was able to use at most only one index for each referenced table

But in 5 it supports some limited index merging.

You really need to understand how indexes work and when they are useful. At what percentage of rows does a Full Table Scan make more sense than an index? Would you believe that in some scenarios a FTS is cheaper than an Index scan that returns 2% of rows? If your Bedroom histogram looks like this 1 = 25%, 2 = 50%, 3 = 20%, >3 = 5%... the only time an index on that column is useful is finding more than 3 bedrooms and it won't use it then because of bind variables and clustering factors.

think of it like this. Assume my percentage of bedrooms is correct. Let's say you have 8k pages (dunno what Mysql uses) and each row is 80 bytes long. Ignoring overhead, you have 100 rows (listings) per page of disk. Since houses are added in random order (random insofar as bedrooms go) in each page you'll have 50 2-bedroom houses, 25 1-bedroom houses, 20 3-bedroom houses and maybe a 4 or 5 or so house on that page. EVERY page will have at least one 1 bedroom house, so you'll read EVERY page for BEDROOMS = 1, same for 2, same for 3. It could help for 5 bedroom houses... but if MySQL bind variable work like Oracle's then it won't switch plans for a given value of Bedrooms.

As you can see, there's a lot to understand... Far more than Jon Skeet has indicated.

Original Post

Most RDBMS can't combine indexes on a single table. If you have a table with columns A, B and C, with single column indexes on A, B and C. and you search where A = a and B = b and C = c. It will pick the most selective index and use only that one.

If you create a single, multicolumn index on A, B, C then that index won't work unless you include A = a in the WHERE. If your where is B = b and C = c then that index is ignored - in most RDBMS's.

This is why Oracle invented the Bitmap index. Bitmap index on A, B and C can be combined with Bitwise AND and Bitwise OR operations. Until a final set of Rowids is determined and Selected columns retrieved.

A bitmap index on the REGION column is shown in the last four columns.

    Row     Region   North   East   West   South
    1       North        1      0      0       0
    2       East         0      1      0       0
    3       West         0      0      1       0
    4       West         0      0      1       0
    5       South        0      0      0       1
    6       North        1      0      0       0

So if you say you want a house WHERE Region in (North, East). You'd Bitwise OR the the North index and the East index and wind up with rows 1, 2, 6

If you had another column with bedroom count such as

  Row     Bedrooms   1BR   2BR
    1       1        1      0 
    2       2        0      1
    3       1        1      0
    4       1        1      0
    5       2        0      1 
    6       2        0      1 

if you AND Bedrooms = 2, that index would return 2, 5, 6 and when Bitwise AND'ed to the Region column would result in rows 2 and 6.

But since you failed to mention the RDBMS I may have completely wasted my time. Oh well.

Wouldn't it be a WHERE x='y' AND a='b' etc query instead?

I'd have thought that several separate indexes should be fine there - no need for anything special.

I'm assuming that your search criteria is discrete, not free-form, that is, you are filtering on something you can quantify like number of bedrooms, size of plot, etc. not whether or not it's in a "sunny location." In that case, I'd suggest that you want to build the query dynamically so that the query only considers the columns of interest in the database. Single column indexes are probably adequate, especially given that you don't seem to have a lot of data. If you find, though, that people are always specifying a couple of columns -- number of bedrooms and number of bathrooms, for example -- then adding a compound index for that combination of columns might be useful. I'd certainly let the statistics and performance drive those decisions, though.

If you're only querying a single table, it will choose the best index to use, if one is applicable. From this perspective, you want to choose columns that are good discriminators and likely to be used in the filter. Limiting the number of indexes can be a good thing, if you know that certain columns will either quickly limit the number of results returned or, conversely, that a particular column isn't a good discriminator. If, for instance, 90% of your houses listed have a plot size of less than an acre and most people search for plots of less than an acre (or don't care), then an index scan based on this index is typically no better than a table scan and there's no need for the index. Indexes do cost something to compute, though for a small database such as yours with infrequent inserts that's likely not an issue.

@Jon is right, I think you probably want to combine the filter properties using an AND rather than an OR. That is, people are generally looking for a house with 3 bedrooms AND 2 bathrooms, not 3 bedroooms OR 2 bathrooms. If you have a filter that allows multiple choices, then you may want to use IN -- say PropertyType IN ('Ranch','SplitLevel',...) instead of an explicty OR (works out the same, but more readable). Note you'd likely being using the foreign key to the PropertyTypes table rather than the text here, but I used the values just for illustration.

What you need is a full-text search engine. Amazon and others use the same. Have a look at http://lucene.apache.org/ and if your platform is based on Java then a much higher level of abstractions could be www.elasticsearch.com and Hibernate Search.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM