简体   繁体   English

Postgres 全文搜索 - 查询速度问题

[英]Postgres Full Text Search - Issues with Query Speed

I'm working on incorporating full-text search into my app.我正在努力将全文搜索整合到我的应用程序中。 In the production version, a user will enter a search phrase which will be searched against 10M+ rows in a table.在生产版本中,用户将输入一个搜索短语,该短语将针对表中的 10M+ 行进行搜索。 I'm currently testing it out with a subset of that data (~800k rows) and having some speed issues.我目前正在使用该数据的子集(约 800k 行)对其进行测试,并且遇到了一些速度问题。 When I run this query:当我运行此查询时:

SELECT title, ts_rank_cd(title_abstract_tsvector, to_tsquery('english','cancer'), 4) AS rank
FROM test_search_articles 
WHERE title_abstract_tsvector @@ to_tsquery('cancer') 
ORDER BY rank LIMIT 50

where 'cancer' is the search term, 25-30 seconds.其中“cancer”是搜索词,25-30 秒。 However, when I change the ORDER BY from rank to id like below:但是,当我将 ORDER BY 从 rank 更改为 id 时,如下所示:

SELECT title, ts_rank_cd(title_abstract_tsvector, to_tsquery('english','cancer'), 4) AS rank 
FROM test_search_articles 
WHERE title_abstract_tsvector @@ to_tsquery('cancer') 
ORDER BY id LIMIT 50

the query takes <1sec.查询需要 <1 秒。 I'm confused why changing the ORDER BY accounts for such a huge change in query speed, especially given that rank is returned in both.我很困惑为什么更改 ORDER BY 会导致查询速度发生如此巨大的变化,尤其是考虑到两者都返回了排名。 Could anyone help me understand this and what to do to make the original query faster?谁能帮助我理解这一点以及如何使原始查询更快? Not sure if it's relevant, but I'm currently using a GIN index on my tsvector column ( title_abstract_tsvector ).不确定它是否相关,但我目前正在我的 tsvector 列( title_abstract_tsvector )上使用 GIN 索引。

EDIT: Running either query without the LIMITs takes 25-30 seconds, answering my question about why ORDER BY id matters.编辑:在没有限制的情况下运行任一查询需要 25-30 秒,回答我关于ORDER BY id为何重要的问题。 AS for how to speed the first query, I'm still looking for a solution至于如何加快第一次查询,我还在寻找解决方案

EDIT 2: Create Index statements编辑 2:创建索引语句

CREATE UNIQUE INDEX test_search_articles_pkey ON public.test_search_articles USING btree (id)

CREATE INDEX article_idx ON public.test_search_articles USING gin (title_abstract_tsvector)

Execution Plan执行计划

"Gather  (cost=1679.97..177072.34 rows=71706 width=103) (actual time=43.963..28084.129 rows=72111 loops=1)"
"  Workers Planned: 2"
"  Workers Launched: 2"
"  Buffers: shared hit=194957 read=97049"
"  I/O Timings: read=80499.893"
"  ->  Parallel Bitmap Heap Scan on test_search_articles  (cost=679.97..168901.74 rows=29878 width=103) (actual time=15.580..28008.573 rows=24037 loops=3)"
"        Recheck Cond: (title_abstract_tsvector @@ to_tsquery('cancer'::text))"
"        Heap Blocks: exact=16483"
"        Buffers: shared hit=194957 read=97049"
"        I/O Timings: read=80499.893"
"        ->  Bitmap Index Scan on article_idx  (cost=0.00..662.04 rows=71706 width=0) (actual time=27.719..27.720 rows=72111 loops=1)"
"              Index Cond: (title_abstract_tsvector @@ to_tsquery('cancer'::text))"
"              Buffers: shared hit=1 read=20"
"              I/O Timings: read=11.768"
"Planning Time: 12.145 ms"
"Execution Time: 28104.318 ms"

EDIT 3:编辑 3:

select pg_relation_size('test_search_articles') : 2176933888 select pg_relation_size('test_search_articles') : 2176933888

select pg_table_size('test_search_articles') : 4283850752 select pg_table_size('test_search_articles') : 4283850752

pg_column_size of title_abstract_tsvector of entire table: 1343.5673777677141794整个表title_abstract_tsvector的pg_column_size: 1343.5673777677141794

pg_column_size of title_abstract_tsvector of rows matching 'cancer' query: 1576.1418923603888450 pg_column_size of title_abstract_tsvector 匹配“癌症”查询的行: 1576.1418923603888450

EDIT 4 Vacuum output message: INFO: vacuuming "public.test_search_articles"编辑 4真空 output 消息:信息:吸尘“public.test_search_articles”

INFO: "test_search_articles": found 0 removable, 1003125 nonremovable row versions in 265739 pages信息:“test_search_articles”:在 265739 个页面中找到 0 个可移动、1003125 个不可移动的行版本

DETAIL: 0 dead row versions cannot be removed yet.详细信息:尚无法删除 0 个死行版本。

CPU: user: 31.15 s, system: 10.38 s, elapsed: 126.14 s. CPU:用户:31.15 s,系统:10.38 s,经过:126.14 s。

INFO: analyzing "public.test_search_articles"信息:分析“public.test_search_articles”

INFO: "test_search_articles": scanned 30000 of 161999 pages, containing 185588 live rows and 0 dead rows; INFO:“test_search_articles”:扫描了 161999 个页面中的 30000 个,包含 185588 个活行和 0 个死行; 30000 rows in sample, 1002169 estimated total rows样本中 30000 行,估计总行数 1002169

VACUUM真空

Essentially all your time is spent on IO.基本上你所有的时间都花在了 IO 上。 So the main thing you can do is get faster IO, or more RAM so you can cache more of the data.因此,您可以做的主要事情是获得更快的 IO 或更多 RAM,以便您可以缓存更多数据。

The fact that your buffers read is 6 times greater than your exact heap blocks suggests that your title_abstract_tsvector is so large that it has been TOASTED and now needs to reassembled from multiple pages in order to be used in the computation of the rank function.您的缓冲区读取比您的确切堆块大 6 倍这一事实表明您的 title_abstract_tsvector 太大以至于它已经被 TOASTED 并且现在需要从多个页面重新组装以便用于计算等级 function。 Is that plausible?这合理吗? How large is that column on average?该列平均有多大?

Are you already saturating your disk capacity?您的磁盘容量是否已经饱和? If not you could try to get a larger degree of parallelization by, for example, increasing max_parallel_workers_per_gather.如果不是,您可以尝试通过例如增加 max_parallel_workers_per_gather 来获得更大程度的并行化。

But the main thing you can do is just not run that query very much.但是您可以做的主要事情就是不要过多地运行该查询。 Either don't let users run such a non-specific query, or precompute and store the results of single-term queries so they can be returned without recomputing them.要么不要让用户运行这样一个非特定的查询,要么预先计算并存储单项查询的结果,这样就可以在不重新计算的情况下返回它们。

Quite often the slow search query is useless, which means optimizing it is a waste of time.很多时候,缓慢的搜索查询是无用的,这意味着优化它是浪费时间。

If there are 24037 records matching "cancer" then searching on this single term will never return relevant results to the user.如果有 24037 条记录与“cancer”匹配,则搜索该单个术语将永远不会向用户返回相关结果。 Therefore it is pointless to sort by relevance.因此,按相关性排序是没有意义的。 What could be more useful would be a bunch of heuristics, for example: if the user enters only one search term, display most recent articles about this term (fast) and maybe offer a list of keywords often related to this term.可能更有用的是一堆启发式方法,例如:如果用户只输入一个搜索词,则显示有关该词的最新文章(快速),并可能提供通常与该词相关的关键字列表。 Then, switch to "ORDER BY rank" only when the user enters enough search keywords to produce a meaningful rank.然后,仅当用户输入足够的搜索关键字以产生有意义的排名时,才切换到“ORDER BY rank”。 This way you can implement a search that is not just faster, but also more useful.通过这种方式,您可以实现不仅更快而且更有用的搜索。

Maybe you will say, "but if I type a single word into google I get relevant results."也许你会说,“但是如果我在 google 中输入一个单词,我就会得到相关的结果。” and.., yes, of course, but that's because google knows everything you do, it always has context, and if you don't enter extra search terms.并且..,是的,当然,但那是因为谷歌知道你所做的一切,它总是有上下文,如果你不输入额外的搜索词。 it will do it for you.它会为你做的。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM