简体繁体 English

最快的子集化方法 - data.table 与 MySQL

[英]Fastest way to subset - data.table vs. MySQL

原文 2011-07-06 01:30:30 8 2 mysql/ r/ rmysql/ data.table

I'm an R user, and I frequently find that I need to write functions that require subsetting large datasets (10s of millions of rows).我是 R 用户，我经常发现我需要编写需要对大型数据集（数百万行）进行子集化的函数。 When I apply such functions over a large number of observations, it can get very time consuming if I'm not careful about how I implement it.当我将这些函数应用于大量观察时，如果我不小心如何实现它，它会变得非常耗时。

To do this, I have sometimes used the data.table package, and this provides much faster speeds than subsetting using data frames.为此，我有时会使用 data.table package，这比使用数据帧的子集提供更快的速度。 Recently, I've started experimenting with packages like RMySQL, pushing some tables to mysql, and using the package to run sql queries and return results.最近，我开始尝试使用 RMySQL 之类的包，将一些表推送到 mysql，并使用 package 运行 sql 查询并返回结果。

I have found mixed performance improvements.我发现混合的性能改进。 For smaller datasets (millions), it seems that loading up the data into a data.table and setting the right keys makes for faster subsetting.对于较小的数据集（数百万），似乎将数据加载到 data.table 并设置正确的键有助于更快的子集。 For larger datasets (10s to 100s of millions), it appears the sending out a query to mysql moves faster.对于较大的数据集（数十到数百万），向 mysql 发送查询似乎移动得更快。

Was wondering if anyone has any insight into which technique should return simple subsetting or aggregation queries faster, and whether or not this should depend on the size of the data?想知道是否有人知道哪种技术应该更快地返回简单的子集或聚合查询，以及这是否应该取决于数据的大小？ I understand that setting keys in data.table is somewhat analogous to creating an index, but I don't have much more intuition beyond that.我知道在 data.table 中设置键有点类似于创建索引，但除此之外我没有更多的直觉。

2 个解决方案

If the data fits in RAM, data.table is faster.如果数据适合 RAM，则 data.table 更快。 If you provide an example it will probably become evident, quickly, that you're using data.table badly.如果您提供一个示例，很可能很快就会发现您正在严重使用 data.table。 Have you read the "do's and don'ts" on the data.table wiki ?您是否阅读过data.table wiki上的“注意事项”？

SQL has a lower bound because it is a row store. SQL 具有下限，因为它是行存储。 If the data fits in RAM (and 64bit is quite a bit) then data.table is faster not just because it is in RAM but because columns are contiguous in memory (minimising page fetches from RAM to L2 for column operations).如果数据适合 RAM（并且 64 位相当多），那么 data.table 更快，不仅因为它在 RAM 中，而且因为 memory 中的列是连续的（最小化从 RAM 到 L2 的页面获取以进行列操作）。 Use data.table correctly and it should be faster than SQL's lower bound.正确使用 data.table 应该比 SQL 的下限快。 This is explained in FAQ 3.1. FAQ 3.1 对此进行了解释。 If you're seeing slower with data.table, then chances are very high that you're using data.table incorrectly (or there's a performance bug that we need to fix).如果您看到 data.table 速度较慢，那么您使用 data.table 的可能性非常高（或者我们需要修复一个性能错误）。 So, please post some tests, after reading the data.table wiki.所以，请在阅读 data.table wiki 之后发布一些测试。

I am not an R user, but I know a little about Databases.我不是 R 用户，但我对数据库知之甚少。 I believe that MySQL (or any other reputatble RDBMS) will actually perform your subsetting operations faster (by, like, an order of magnitude, usually) barring any additional computation involved in the subsetting process.我相信 MySQL（或任何其他有信誉的 RDBMS）实际上会更快地执行您的子集操作（通常是一个数量级），除非子集过程中涉及任何额外的计算。

I suspect your performance lag on small data sets is related to the expense of the connection and initial push of the data to MySQL.我怀疑您在小型数据集上的性能滞后与连接费用和将数据初始推送到 MySQL 相关。 There is likely a point at which the connection overhead and data transfer time adds more to the cost of your operation than MySQL is saving you.连接开销和数据传输时间可能会比 MySQL 为您节省更多的操作成本。

However, for datasets larger than a certain minimum, it seem likley that this cost is compensated for by the sheer speed of the database.但是，对于大于某个最小值的数据集，这种成本似乎可以通过数据库的绝对速度得到补偿。

My understanding is that SQL can acheive most fetching and sorting operations much, much more quickly than iterative operations in code.我的理解是 SQL 可以比代码中的迭代操作更快地实现大多数获取和排序操作。 But one must factor in the cost of the connection and (in this case) the initial transfer of data over the network wire.但是必须考虑连接成本和（在这种情况下）通过网络线路传输数据的初始成本。

I will be interested to hear what others have to say.我很想听听其他人怎么说。 . . . .