简体繁体 English

内存，磁盘和数据库获取的数据

[英]Memory, disk and databases fetched data

原文 2011-12-13 14:29:20 9 1 database/ memory

Let's say that I am going to extract a big dataset from a relational db. 假设我要从关系数据库中提取一个大数据集。 However, I do not want to fill more than 100MB of memory (this is an arbitrary limit). 但是，我不想填充超过100MB的内存（这是一个任意限制）。 Also, I want to perform certain operations on this dataset. 另外，我想对该数据集执行某些操作。

Normally, in a language like python, I would just put all the fetched data in memory. 通常，在像python这样的语言中，我只是将所有获取的数据放入内存中。 But I would like to avoid this. 但我想避免这种情况。 So, probably I have to introduce a middle step where I write the queried data on disk and then I process them chunk by chunk. 因此，可能我必须引入中间步骤，在该步骤中，将查询的数据写入磁盘，然后逐块处理它们。

What would be the best way to handle this scenario? 处理这种情况的最佳方法是什么？

1 个解决方案

Something like this happened to me recently. 这样的事最近发生在我身上。 A database table without a unique index (has one now) was getting the same data inserted over and over again up to 30 times. 没有唯一索引（现在有一个索引）的数据库表一次又一次地插入相同的数据，最多重复30次。 The table was about 55 M rows. 该表大约有5500万行。

I wrote a Python program to find one row, and delete all duplicates. 我编写了一个Python程序来查找一行并删除所有重复项。 mysqldb crashed on trying to create the query, even before the fetchone call. 甚至在fetchone调用之前，mysqldb在尝试创建查询时崩溃。

However, I was able to extract the data into a spreadsheet, filter using Python's CSV library, and replace the data in the table. 但是，我能够将数据提取到电子表格中，使用Python的CSV库进行过滤，然后替换表中的数据。 It was a mess. 一团糟。

It would be helpful to know the database brand/type in question and the platform you are using, but platform is a little less important. 知道所涉及的数据库品牌/类型以及您所使用的平台会很有帮助，但是平台的重要性要小一些。

Edit: 编辑：

As a rule, I have found that sometimes creating data to be batch loaded can be a lot faster than updating a table one row at a time. 通常，我发现有时候创建要批量加载的数据可能比一次更新一个表快很多。 I have proved this empirically today by cutting in some changes to calculate and print tax bills. 我今天通过减少一些计算和打印税单的经验证明了这一点。 Instead of updating a table in a transaction block (one row at a time) the program prints a delimited "report" (data to be loaded into MySQL) and batch loads it after the bills have been calculated and printed. 该程序将打印定界的“报告”（要加载到MySQL中的数据），并在计算并打印完帐单后将其批量加载，而不是更新交易块中的表（一次一行）。 The speed increase was quite noticeable. 速度提高非常明显。