简体   繁体   English

PostgreSQL LIKE和正则表达式之间的性能差异

[英]PostgreSQL performance difference between LIKE and regex

Could someone explain such a big performance difference between these SQLs ? 有人能解释这些SQL之间如此大的性能差异吗?

SELECT count(*) as cnt FROM table WHERE name ~ '\*{3}'; -- Total runtime 12.000 - 18.000 ms
SELECT count(*) as cnt FROM table WHERE name ~ '\*\*\*'; -- Total runtime 12.000 - 18.000 ms
SELECT count(*) as cnt FROM table WHERE name LIKE '%***%'; -- Total runtime 5.000 - 7.000 ms

As you can see, the difference is more than double between LIKE operator and simple regular expression (I thought LIKE operator internally would be converted into the regular expression and there shouldn't be any difference) 正如您所看到的,LIKE运算符和简单正则表达式之间的差异是两倍以上(我认为LIKE运算符内部将转换为正则表达式,并且应该没有任何区别)

There are almost 13000 rows there and the column "name" is of "text" type. 那里有近13000行,“name”列是“text”类型。 There are no indexes related to the "name" column defined in the table. 没有与表中定义的“name”列相关的索引。

EDIT: 编辑:

EXPLAIN ANALYZE OF EACH OF THEM: 解释他们的每一个:

EXPLAIN ANALYZE SELECT count(*) as cnt FROM datos WHERE nombre ~ '\*{3}';

Aggregate  (cost=894.32..894.33 rows=1 width=0) (actual time=18.279..18.280 rows=1 loops=1)
  ->  Seq Scan on datos (cost=0.00..894.31 rows=1 width=0) (actual time=0.620..18.266 rows=25 loops=1)
        Filter: (nombre ~ '\*{3}'::text)
Total runtime: 18.327 ms

EXPLAIN ANALYZE SELECT count(*) as cnt FROM datos WHERE nombre ~ '\*\*\*';
Aggregate  (cost=894.32..894.33 rows=1 width=0) (actual time=17.404..17.405 rows=1 loops=1)
  ->  Seq Scan on datos  (cost=0.00..894.31 rows=1 width=0) (actual time=0.608..17.396 rows=25 loops=1)
        Filter: (nombre ~ '\*\*\*'::text)
Total runtime: 17.451 ms

EXPLAIN ANALYZE SELECT count(*) as cnt  FROM datos WHERE nombre LIKE '%***%';
Aggregate  (cost=894.32..894.33 rows=1 width=0) (actual time=4.258..4.258 rows=1 loops=1)
  ->  Seq Scan on datos  (cost=0.00..894.31 rows=1 width=0) (actual time=0.138..4.249 rows=25 loops=1)
        Filter: (nombre ~~ '%***%'::text)
Total runtime: 4.295 ms

The text LIKE text operator ( ~~ ) is implemented by specific C code in like_match.c . text LIKE text运算符( ~~ )由like_match.c中的特定C代码实现 It's ad-hoc code that is completely independent from regular expressions. 它是与正则表达式完全独立的特殊代码。 Looking at the comments, it's obviously specially optimized to implement only % and _ as wildcards, and short-circuiting to an exit whenever possible, whereas a regular expression engine is more complex by several orders of magnitude. 看一下这些评论,它显然经过特别优化,只能实现%_作为通配符,并尽可能短路到退出,而正则表达式引擎则要复杂几个数量级。

Note that in your test case , just like the regexp is suboptimal compared to LIKE , LIKE is probably suboptimal compared to strpos(name, '***') > 0 请注意,在您的测试用例中,就像正则表达式与LIKE相比不是最理想的,与strpos(name, '***') > 0相比, LIKE可能不是最理想的

strpos is implemented with the Boyer–Moore–Horspool algorithm which is optimized for large substrings with few partial matches in the searched text. strpos是使用Boyer-Moore-Horspool算法实现的,该算法针对搜索文本中几乎没有部分匹配的大子串进行了优化。

Internally these functions are reasonably optimized but when there are several methods to the same goal, choosing the likely best is still the job of the caller. 在内部,这些功能得到了合理的优化,但是当有多种方法可以达到同一目标时,选择最好的方法仍然是调用者的工作。 PostgreSQL will not analyze for us the pattern to match and switch a regexp into a LIKE or a LIKE into a strpos based on that analysis. PostgreSQL不会为我们分析匹配的模式,并根据该分析将正则regexp切换为LIKELIKEstrpos

I am not sure if I should publish it like an answer... I made a rough comparison making something similar in PHP - filtering huge array using regex and simple strpos (as a substitute to LIKE). 我不确定是否应该像答案那样发布它...我在PHP中做了类似的粗略比较 - 使用正则表达式和简单的strpos(作为LIKE的替代)过滤大数组。 The code: 编码:

// regex filter
$filteredRegex = array_filter($a,function($item){
    return preg_match('/000/',$item);
});
// substring search filter
$filteredStrpos = array_filter($a,function($item){
    return strpos($item,'000')!==FALSE;
});

So, benchmarking this code results in that regex filter doubles the result of strpos in time, so I can suppose that CPU cost of regex is roughly double of simple search of substring. 因此,对此代码进行基准测试会导致正则表达式过滤器使strpos的结果在时间上翻倍,因此我可以假设正则表达式的CPU成本大约是简单搜索子字符串的两倍。

Looks like @zerkms had all the reason :) 看起来@zerkms有所有原因:)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM