简体   繁体   English

内存,磁盘和数据库获取的数据

[英]Memory, disk and databases fetched data

Let's say that I am going to extract a big dataset from a relational db. 假设我要从关系数据库中提取一个大数据集。 However, I do not want to fill more than 100MB of memory (this is an arbitrary limit). 但是,我不想填充超过100MB的内存(这是一个任意限制)。 Also, I want to perform certain operations on this dataset. 另外,我想对该数据集执行某些操作。

Normally, in a language like python, I would just put all the fetched data in memory. 通常,在像python这样的语言中,我只是将所有获取的数据放入内存中。 But I would like to avoid this. 但我想避免这种情况。 So, probably I have to introduce a middle step where I write the queried data on disk and then I process them chunk by chunk. 因此,可能我必须引入中间步骤,在该步骤中,将查询的数据写入磁盘,然后逐块处理它们。

What would be the best way to handle this scenario? 处理这种情况的最佳方法是什么?

Something like this happened to me recently. 这样的事最近发生在我身上。 A database table without a unique index (has one now) was getting the same data inserted over and over again up to 30 times. 没有唯一索引(现在有一个索引)的数据库表一次又一次地插入相同的数据,最多重复30次。 The table was about 55 M rows. 该表大约有5500万行。

I wrote a Python program to find one row, and delete all duplicates. 我编写了一个Python程序来查找一行并删除所有重复项。 mysqldb crashed on trying to create the query, even before the fetchone call. 甚至在fetchone调用之前,mysqldb在尝试创建查询时崩溃。

However, I was able to extract the data into a spreadsheet, filter using Python's CSV library, and replace the data in the table. 但是,我能够将数据提取到电子表格中,使用Python的CSV库进行过滤,然后替换表中的数据。 It was a mess. 一团糟。

It would be helpful to know the database brand/type in question and the platform you are using, but platform is a little less important. 知道所涉及的数据库品牌/类型以及您所使用的平台会很有帮助,但是平台的重要性要小一些。

Edit: 编辑:

As a rule, I have found that sometimes creating data to be batch loaded can be a lot faster than updating a table one row at a time. 通常,我发现有时候创建要批量加载的数据可能比一次更新一个表快很多。 I have proved this empirically today by cutting in some changes to calculate and print tax bills. 我今天通过减少一些计算和打印税单的经验证明了这一点。 Instead of updating a table in a transaction block (one row at a time) the program prints a delimited "report" (data to be loaded into MySQL) and batch loads it after the bills have been calculated and printed. 该程序将打印定界的“报告”(要加载到MySQL中的数据),并在计算并打印完帐单后将其批量加载,而不是更新交易块中的表(一次一行)。 The speed increase was quite noticeable. 速度提高非常明显。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM