简体   繁体   中英

Best index to use for PostgreSQL full text search with weighted tsvector

I'm new to databases and don't have a firm grasp on how indexing works.

I'm looking into indexing a column in my that contains a tsvector that is weighted (title is given the greatest weight, followed by subheading and then paragraph contents). According to the Postgres documentation, GIN is the best one to use for full text search, followed by GiST. However there is a note in chapter 12.9:

GIN indexes are the preferred text search index type. As inverted indexes, they contain an index entry for each word (lexeme), with a compressed list of matching locations. Multi-word searches can find the first match, then use the index to remove rows that are lacking additional words. GIN indexes store only the words (lexemes) of tsvector values, and not their weight labels. Thus a table row recheck is needed when using a query that involves weights.

Does this mean that GIN is inefficient in my use case and I should go with GiST, or is it still the best one to use? I'm using the latest Postgres version (12).

No, you should stick with GIN indexes.

The index scan acts as a filter and hopefully eliminates most of the rows, so that only few have to be rechecked.

You probably have to fetch the table rows anyway, so unless there are many false positives found during the index scan, that won't be a lot of extra work.

The best thing would be to run some benchmarks on your data set, that would give you an authoritative answer which index is better in your case.

To find out how many false positives were eliminated during the bitmap heap scan, you caan examine the ouput of EXPLAIN (ANALYZE, BUFFERS) for the query.

The implementation of GiST indexes for tsvector is lossy, so they also need to consult the table. That part of documentation is weird, as it seems to be contrasting GIN to GiST but neither GIN nor GiST stores the weights, so there is nothing to contrast. (GiST doesn't even store the values much less the weights, just a hashed bit of the value).

Also, weights are only used when ranking, not when searching.

About the only time GiST would be prefered for tsvector is if you want a multicolumn index where you will be ANDing together selective criteria on the different columns.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM