简体   繁体   English

为什么SQL聚合函数比Python和Java(或穷人的OLAP)慢得多

[英]Why are SQL aggregate functions so much slower than Python and Java (or Poor Man's OLAP)

I need a real DBA's opinion. 我需要一个真正的DBA的意见。 Postgres 8.3 takes 200 ms to execute this query on my Macbook Pro while Java and Python perform the same calculation in under 20 ms (350,000 rows): Postgres 8.3在我的Macbook Pro上执行此查询需要200 ms,而Java和Python在20 ms(350,000行)内执行相同的计算:

SELECT count(id), avg(a), avg(b), avg(c), avg(d) FROM tuples;

Is this normal behaviour when using a SQL database? 使用SQL数据库时这是正常的行为吗?

The schema (the table holds responses to a survey): 模式(该表包含对调查的响应):

CREATE TABLE tuples (id integer primary key, a integer, b integer, c integer, d integer);

\copy tuples from '350,000 responses.csv' delimiter as ','

I wrote some tests in Java and Python for context and they crush SQL (except for pure python): 我用Java和Python编写了一些测试用于上下文,他们粉碎了SQL(纯Python除外):

java   1.5 threads ~ 7 ms    
java   1.5         ~ 10 ms    
python 2.5 numpy   ~ 18 ms  
python 2.5         ~ 370 ms

Even sqlite3 is competitive with Postgres despite it assumping all columns are strings (for contrast: even using just switching to numeric columns instead of integers in Postgres results in 10x slowdown) 即使sqlite3与Postgres竞争,尽管它假设所有列都是字符串(相比之下:即使只使用切换到数字列而不是Postgres中的整数导致10x减速)

Tunings i've tried without success include (blindly following some web advice): 我试过没有成功的调整包括(盲目地遵循一些网络建议):

increased the shared memory available to Postgres to 256MB    
increased the working memory to 2MB
disabled connection and statement logging
used a stored procedure via CREATE FUNCTION ... LANGUAGE SQL

So my question is, is my experience here normal, and this is what I can expect when using a SQL database? 所以我的问题是,我的体验是正常的,这是我在使用SQL数据库时可以期待的吗? I can understand that ACID must come with costs, but this is kind of crazy in my opinion. 我可以理解ACID必须带来成本,但在我看来这有点疯狂。 I'm not asking for realtime game speed, but since Java can process millions of doubles in under 20 ms, I feel a bit jealous. 我不是要求实时游戏速度,但由于Java可以在20毫秒内处理数百万的双打,我感到有点嫉妒。

Is there a better way to do simple OLAP on the cheap (both in terms of money and server complexity)? 是否有更好的方法以便宜的方式进行简单的OLAP(在金钱和服务器复杂性方面)? I've looked into Mondrian and Pig + Hadoop but not super excited about maintaining yet another server application and not sure if they would even help. 我已经研究过Mondrian和Pig + Hadoop,但对于维护另一个服务器应用程序并不确定它们是否会提供帮助并不是非常兴奋。


No the Python code and Java code do all the work in house so to speak. 没有Python代码和Java代码可以完成内部的所有工作。 I just generate 4 arrays with 350,000 random values each, then take the average. 我只生成4个阵列,每个阵列有350,000个随机值,然后取平均值。 I don't include the generation in the timings, only the averaging step. 我不在时间中包括生成,只包括平均步骤。 The java threads timing uses 4 threads (one per array average), overkill but it's definitely the fastest. java线程计时使用4个线程(每个阵列平均一个),过度杀伤但它绝对是最快的。

The sqlite3 timing is driven by the Python program and is running from disk (not :memory:) sqlite3时序由Python程序驱动,并从磁盘运行(不是:内存:)

I realize Postgres is doing much more behind the scenes, but most of that work doesn't matter to me since this is read only data. 我意识到Postgres在幕后做得更多,但大部分工作对我来说并不重要,因为这是只读数据。

The Postgres query doesn't change timing on subsequent runs. Postgres查询不会更改后续运行的时间。

I've rerun the Python tests to include spooling it off the disk. 我重新运行Python测试以包括将其从磁盘中删除。 The timing slows down considerably to nearly 4 secs. 时间大大减慢到近4秒。 But I'm guessing that Python's file handling code is pretty much in C (though maybe not the csv lib?) so this indicates to me that Postgres isn't streaming from the disk either (or that you are correct and I should bow down before whoever wrote their storage layer!) 但我猜测Python的文件处理代码几乎是在C语言中(尽管可能不是csv lib?)所以这向我表明Postgres也没有从磁盘流式传输(或者说你是正确的我应该鞠躬在谁写了他们的存储层之前!)

I would say your test scheme is not really useful. 我会说你的测试方案并不真正有用。 To fulfill the db query, the db server goes through several steps: 要完成db查询,db服务器将执行以下几个步骤:

  1. parse the SQL 解析SQL
  2. work up a query plan, ie decide on which indices to use (if any), optimize etc. 制定查询计划,即决定使用哪些指数(如果有的话),优化等。
  3. if an index is used, search it for the pointers to the actual data, then go to the appropriate location in the data or 如果使用索引,则搜索指向实际数据的指针,然后转到数据中的适当位置或
  4. if no index is used, scan the whole table to determine which rows are needed 如果没有使用索引,则扫描整个表以确定需要哪些行
  5. load the data from disk into a temporary location (hopefully, but not necessarily, memory) 将数据从磁盘加载到临时位置(希望,但不一定,内存)
  6. perform the count() and avg() calculations 执行count()和avg()计算

So, creating an array in Python and getting the average basically skips all these steps save the last one. 因此,在Python中创建一个数组并获得平均值基本上会跳过所有这些步骤,保存最后一个。 As disk I/O is among the most expensive operations a program has to perform, this is a major flaw in the test (see also the answers to this question I asked here before). 由于磁盘I / O是程序必须执行的最昂贵的操作之一,这是测试中的一个主要缺陷(另请参阅我之前在问过的这个问题的答案)。 Even if you read the data from disk in your other test, the process is completely different and it's hard to tell how relevant the results are. 即使您在其他测试中从磁盘读取数据,该过程也完全不同,并且很难判断结果的相关性。

To obtain more information about where Postgres spends its time, I would suggest the following tests: 要获得有关Postgres花费时间的更多信息,我建议进行以下测试:

  • Compare the execution time of your query to a SELECT without the aggregating functions (ie cut step 5) 将查询的执行时间与没有聚合函数的SELECT进行比较(即切换步骤5)
  • If you find that the aggregation leads to a significant slowdown, try if Python does it faster, obtaining the raw data through the plain SELECT from the comparison. 如果您发现聚合导致显着减速,请尝试Python更快地执行此操作,通过比较中的普通SELECT获取原始数据。

To speed up your query, reduce disk access first. 要加快查询速度,请首先减少磁盘访问。 I doubt very much that it's the aggregation that takes the time. 我非常怀疑这是花费时间的聚合。

There's several ways to do that: 有几种方法可以做到这一点:

  • Cache data (in memory!) for subsequent access, either via the db engine's own capabilities or with tools like memcached 缓存数据(在内存中!),以便通过db引擎自身的功能或使用memcached等工具进行后续访问
  • Reduce the size of your stored data 减少存储数据的大小
  • Optimize the use of indices. 优化指数的使用。 Sometimes this can mean to skip index use altogether (after all, it's disk access, too). 有时这可能意味着完全跳过索引使用(毕竟,它也是磁盘访问)。 For MySQL, I seem to remember that it's recommended to skip indices if you assume that the query fetches more than 10% of all the data in the table. 对于MySQL,我似乎记得如果你假设查询占据表中所有数据的10%以上,建议跳过索引。
  • If your query makes good use of indices, I know that for MySQL databases it helps to put indices and data on separate physical disks. 如果您的查询充分利用索引,我知道对于MySQL数据库,它有助于将索引和数据放在不同的物理磁盘上。 However, I don't know whether that's applicable for Postgres. 但是,我不知道这是否适用于Postgres。
  • There also might be more sophisticated problems such as swapping rows to disk if for some reason the result set can't be completely processed in memory. 如果由于某种原因结果集无法在内存中完全处理,也可能存在更复杂的问题,例如将行交换到磁盘。 But I would leave that kind of research until I run into serious performance problems that I can't find another way to fix, as it requires knowledge about a lot of little under-the-hood details in your process. 但是我会留下那种研究,直到我遇到严重的性能问题,我找不到另一种方法来修复,因为它需要了解你的过程中很多小的底层细节。

Update: 更新:

I just realized that you seem to have no use for indices for the above query and most likely aren't using any, too, so my advice on indices probably wasn't helpful. 我刚刚意识到你似乎没有使用上述查询的索引,并且很可能也没有使用任何索引,所以我对索引的建议可能没有帮助。 Sorry. 抱歉。 Still, I'd say that the aggregation is not the problem but disk access is. 不过,我会说聚合不是问题,但磁盘访问是。 I'll leave the index stuff in, anyway, it might still have some use. 我会留下索引的东西,无论如何,它可能还有一些用处。

Postgres is doing a lot more than it looks like (maintaining data consistency for a start!) Postgres做得比看起来要多得多(维护数据的一致性!)

If the values don't have to be 100% spot on, or if the table is updated rarely, but you are running this calculation often, you might want to look into Materialized Views to speed it up. 如果值不必是100%点,或者表很少更新,但是您经常运行此计算,则可能需要查看物化视图以加快速度。

(Note, I have not used materialized views in Postgres, they look at little hacky, but might suite your situation). (注意,我没有在Postgres中使用物化视图,他们看起来有点hacky,但可能适合你的情况)。

Materialized Views 物化观点

Also consider the overhead of actually connecting to the server and the round trip required to send the request to the server and back. 还要考虑实际连接到服务器的开销以及将请求发送到服务器并返回所需的往返。

I'd consider 200ms for something like this to be pretty good, A quick test on my oracle server, the same table structure with about 500k rows and no indexes, takes about 1 - 1.5 seconds, which is almost all just oracle sucking the data off disk. 我认为200ms对于这样的事情是相当不错的,在我的oracle服务器上快速测试,相同的表结构有大约500k行而没有索引,需要大约1 - 1.5秒,这几乎都是oracle吸吮数据关闭磁盘。

The real question is, is 200ms fast enough? 真正的问题是,200毫秒足够快吗?

-------------- More -------------------- - - - - - - - 更多 - - - - - - - - - -

I was interested in solving this using materialized views, since I've never really played with them. 我有兴趣使用物化视图来解决这个问题,因为我从未真正使用它们。 This is in oracle. 这是在甲骨文。

First I created a MV which refreshes every minute. 首先,我创建了一个每分钟刷新一次的MV。

create materialized view mv_so_x 
build immediate 
refresh complete 
START WITH SYSDATE NEXT SYSDATE + 1/24/60
 as select count(*),avg(a),avg(b),avg(c),avg(d) from so_x;

While its refreshing, there is no rows returned 虽然它令人耳目一新,但没有返回任何行

SQL> select * from mv_so_x;

no rows selected

Elapsed: 00:00:00.00

Once it refreshes, its MUCH faster than doing the raw query 刷新后,它比原始查询更快

SQL> select count(*),avg(a),avg(b),avg(c),avg(d) from so_x;

  COUNT(*)     AVG(A)     AVG(B)     AVG(C)     AVG(D)
---------- ---------- ---------- ---------- ----------
   1899459 7495.38839 22.2905454 5.00276131 2.13432836

Elapsed: 00:00:05.74
SQL> select * from mv_so_x;

  COUNT(*)     AVG(A)     AVG(B)     AVG(C)     AVG(D)
---------- ---------- ---------- ---------- ----------
   1899459 7495.38839 22.2905454 5.00276131 2.13432836

Elapsed: 00:00:00.00
SQL> 

If we insert into the base table, the result is not immediately viewable view the MV. 如果我们插入基表,结果就不能立即查看MV。

SQL> insert into so_x values (1,2,3,4,5);

1 row created.

Elapsed: 00:00:00.00
SQL> commit;

Commit complete.

Elapsed: 00:00:00.00
SQL> select * from mv_so_x;

  COUNT(*)     AVG(A)     AVG(B)     AVG(C)     AVG(D)
---------- ---------- ---------- ---------- ----------
   1899459 7495.38839 22.2905454 5.00276131 2.13432836

Elapsed: 00:00:00.00
SQL> 

But wait a minute or so, and the MV will update behind the scenes, and the result is returned fast as you could want. 但是等一下左右,MV将在幕后更新,结果会根据您的需要快速返回。

SQL> /

  COUNT(*)     AVG(A)     AVG(B)     AVG(C)     AVG(D)
---------- ---------- ---------- ---------- ----------
   1899460 7495.35823 22.2905352 5.00276078 2.17647059

Elapsed: 00:00:00.00
SQL> 

This isn't ideal. 这不太理想。 for a start, its not realtime, inserts/updates will not be immediately visible. 一开始,它不是实时的,插入/更新将不会立即可见。 Also, you've got a query running to update the MV whether you need it or not (this can be tune to whatever time frame, or on demand). 此外,无论您是否需要,您都可以运行查询来更新MV(这可以调整到任何时间范围或按需)。 But, this does show how much faster an MV can make it seem to the end user, if you can live with values which aren't quite upto the second accurate. 但是,这确实表明了MV可以让最终用户看起来更快,如果你可以使用不高于第二精确度的值。

I retested with MySQL specifying ENGINE = MEMORY and it doesn't change a thing (still 200 ms). 我重新测试了MySQL指定ENGINE = MEMORY并且它没有改变一件事(仍然是200毫秒)。 Sqlite3 using an in-memory db gives similar timings as well (250 ms). 使用内存数据库的Sqlite3也提供了类似的时序(250毫秒)。

The math here looks correct (at least the size, as that's how big the sqlite db is :-) 这里的数学看起来是正确的(至少是大小,因为sqlite数据库有多大:-)

I'm just not buying the disk-causes-slowness argument as there is every indication the tables are in memory (the postgres guys all warn against trying too hard to pin tables to memory as they swear the OS will do it better than the programmer) 我只是没有购买disk-cause-slowness参数,因为每个迹象表明这些表都在内存中(postgres的所有人都警告不要过于努力将表格固定到内存中,因为他们发誓操作系统会比程序员做得更好)

To clarify the timings, the Java code is not reading from disk, making it a totally unfair comparison if Postgres is reading from the disk and calculating a complicated query, but that's really besides the point, the DB should be smart enough to bring a small table into memory and precompile a stored procedure IMHO. 为了澄清时间,Java代码不是从磁盘读取,如果Postgres从磁盘读取并计算一个复杂的查询,这是一个完全不公平的比较,但是除了这一点之外,DB应该足够智能带来一个小的表进入内存并预编译存储过程恕我直言。

UPDATE (in response to the first comment below): 更新(回应下面的第一条评论):

I'm not sure how I'd test the query without using an aggregation function in a way that would be fair, since if i select all of the rows it'll spend tons of time serializing and formatting everything. 我不确定如何在不使用聚合函数的情况下以一种公平的方式测试查询,因为如果我选择所有行,它将花费大量时间来序列化和格式化所有内容。 I'm not saying that the slowness is due to the aggregation function, it could still be just overhead from concurrency, integrity, and friends. 我并不是说缓慢是由聚合函数引起的,它可能只是来自并发性,完整性和朋友的开销。 I just don't know how to isolate the aggregation as the sole independent variable. 我只是不知道如何将聚合隔离为唯一的自变量。

Those are very detailed answers, but they mostly beg the question, how do I get these benefits without leaving Postgres given that the data easily fits into memory, requires concurrent reads but no writes and is queried with the same query over and over again. 这些是非常详细的答案,但他们大多提出了这个问题,如果数据很容易适应内存,需要并发读取但不需要写入,并且一遍又一遍地查询相同的查询,如何在不离开Postgres的情况下获得这些好处。

Is it possible to precompile the query and optimization plan? 是否可以预编译查询和优化计划? I would have thought the stored procedure would do this, but it doesn't really help. 我本以为存储过程会这样做,但它并没有真正帮助。

To avoid disk access it's necessary to cache the whole table in memory, can I force Postgres to do that? 为了避免磁盘访问,有必要将整个表缓存到内存中,我可以强制Postgres这样做吗? I think it's already doing this though, since the query executes in just 200 ms after repeated runs. 我认为它已经这样做了,因为在重复运行后查询只在200毫秒内执行。

Can I tell Postgres that the table is read only, so it can optimize any locking code? 我可以告诉Postgres该表是只读的,所以它可以优化任何锁定代码吗?

I think it's possible to estimate the query construction costs with an empty table (timings range from 20-60 ms) 我认为用空表估计查询构造成本是可能的(时间范围从20-60毫秒)

I still can't see why the Java/Python tests are invalid. 我仍然不明白为什么Java / Python测试无效。 Postgres just isn't doing that much more work (though I still haven't addressed the concurrency aspect, just the caching and query construction) Postgres只是没有做更多的工作(虽然我还没有解决并发方面,只是缓存和查询结构)

UPDATE: I don't think it's fair to compare the SELECTS as suggested by pulling 350,000 through the driver and serialization steps into Python to run the aggregation, nor even to omit the aggregation as the overhead in formatting and displaying is hard to separate from the timing. 更新:我认为通过将350,000通过驱动程序和序列化步骤拉入Python来运行聚合来比较SELECTS是不公平的,甚至也没有省略聚合,因为格式化和显示的开销难以与定时。 If both engines are operating on in memory data, it should be an apples to apples comparison, I'm not sure how to guarantee that's already happening though. 如果两个引擎都在内存数据中运行,它应该是一个苹果对苹果的比较,我不知道如何保证已经发生了。

I can't figure out how to add comments, maybe i don't have enough reputation? 我无法弄清楚如何添加评论,也许我没有足够的声誉?

I'm a MS-SQL guy myself, and we'd use DBCC PINTABLE to keep a table cached, and SET STATISTICS IO to see that it's reading from cache, and not disk. 我自己就是MS-SQL人员,我们使用DBCC PINTABLE来保持表缓存,并使用SET STATISTICS IO来查看它是从缓存中读取的,而不是磁盘。

I can't find anything on Postgres to mimic PINTABLE, but pg_buffercache seems to give details on what is in the cache - you may want to check that, and see if your table is actually being cached. 我在Postgres上找不到任何模仿PINTABLE的东西,但是pg_buffercache似乎提供了缓存中的内容的详细信息 - 你可能想检查一下,看看你的表是否真的被缓存了。

A quick back of the envelope calculation makes me suspect that you're paging from disk. 快速回退信封计算让我怀疑你是从磁盘分页。 Assuming Postgres uses 4-byte integers, you have (6 * 4) bytes per row, so your table is a minimum of (24 * 350,000) bytes ~ 8.4MB. 假设Postgres使用4字节整数,则每行有(6 * 4)个字节,因此您的表最小为(24 * 350,000)字节~8.4MB。 Assuming 40 MB/s sustained throughput on your HDD, you're looking at right around 200ms to read the data (which, as pointed out , should be where almost all of the time is being spent). 假设你的硬盘有40 MB / s的持续吞吐量,那么你正在寻找大约200ms的时间来读取数据( 正如所指出的那样 ,几乎所有的时间都在这里)。

Unless I screwed up my math somewhere, I don't see how it's possible that you are able to read 8MB into your Java app and process it in the times you're showing - unless that file is already cached by either the drive or your OS. 除非我搞砸了某个地方的数学,否则我看不出你有可能在你的Java应用程序中读取8MB并在你显示的时候处理它 - 除非该文件已被驱动器或你的文件缓存OS。

I don't think that your results are all that surprising -- if anything it is that Postgres is so fast. 我认为你的结果并不令人惊讶 - 如果说Postgres是如此之快的话。

Does the Postgres query run faster a second time once it has had a chance to cache the data? 一旦有机会缓存​​数据,Postgres查询是否会再次运行得更快? To be a little fairer your test for Java and Python should cover the cost of acquiring the data in the first place (ideally loading it off disk). 为了更公平,您对Java和Python的测试应首先涵盖获取数据的成本(理想情况下将其从磁盘上加载)。

If this performance level is a problem for your application in practice but you need a RDBMS for other reasons then you could look at memcached . 如果这个性能级别在实践中对您的应用程序来说是个问题,但是出于其他原因需要RDBMS,那么您可以查看memcached You would then have faster cached access to raw data and could do the calculations in code. 然后,您可以更快地对原始数据进行缓存访问,并可以在代码中进行计算。

One other thing that an RDBMS generally does for you is to provide concurrency by protecting you from simultaneous access by another process. RDBMS通常为您做的另一件事是通过保护您免受另一个进程的同时访问来提供并发性。 This is done by placing locks, and there's some overhead from that. 这是通过放置锁来完成的,并且还有一些开销。

If you're dealing with entirely static data that never changes, and especially if you're in a basically "single user" scenario, then using a relational database doesn't necessarily gain you much benefit. 如果您正在处理永远不会发生变化的完全静态数据,特别是如果您处于基本上“单用户”的情况下,那么使用关系数据库并不一定会带来很多好处。

Are you using TCP to access the Postgres? 您使用TCP访问Postgres吗? In that case Nagle is messing with your timing. 在那种情况下,Nagle正在搞乱你的时间安排。

你需要将postgres的缓存增加到整个工作集适合内存的程度,然后才能看到性能与内存中的程序相当。

Thanks for the Oracle timings, that's the kind of stuff I'm looking for (disappointing though :-) 感谢Oracle的时间安排,这就是我正在寻找的东西(尽管令人失望:-)

Materialized views are probably worth considering as I think I can precompute the most interesting forms of this query for most users. 物化视图可能值得考虑,因为我认为我可以为大多数用户预先计算此查询的最有趣形式。

I don't think query round trip time should be very high as i'm running the the queries on the same machine that runs Postgres, so it can't add much latency? 我不认为查询往返时间应该非常高,因为我在运行Postgres的同一台机器上运行查询,所以它不能增加很多延迟?

I've also done some checking into the cache sizes, and it seems Postgres relies on the OS to handle caching, they specifically mention BSD as the ideal OS for this, so I thinking Mac OS ought to be pretty smart about bringing the table into memory. 我也做了一些检查缓存大小,看起来Postgres依赖操作系统来处理缓存,他们特别提到BSD是这个的理想操作系统,所以我认为Mac OS应该非常聪明才能将表格带入记忆。 Unless someone has more specific params in mind I think more specific caching is out of my control. 除非有人考虑更具体的参数,否则我认为更具体的缓存是我无法控制的。

In the end I can probably put up with 200 ms response times, but knowing that 7 ms is a possible target makes me feel unsatisfied, as even 20-50 ms times would enable more users to have more up to date queries and get rid of a lots of caching and precomputed hacks. 最后我可能忍受200毫秒的响应时间,但知道7毫秒是一个可能的目标让我感到不满意,因为甚至20-50毫秒时间将使更多的用户有更多的最新查询和摆脱大量的缓存和预先计算的黑客攻击。

I just checked the timings using MySQL 5 and they are slightly worse than Postgres. 我刚刚使用MySQL 5检查了时间,它们比Postgres略差。 So barring some major caching breakthroughs, I guess this is what I can expect going the relational db route. 因此,除了一些主要的缓存突破,我想这是我可以期待的关系数据库路由。

I wish I could up vote some of your answers, but I don't have enough points yet. 我希望我可以投票给你一些答案,但我还没有足够的分数。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM