简体繁体 English

Python中大数据结构的性能

[英]Performance of Large Data Structures in Python

原文 2013-09-20 16:27:55 1 3 python/ arrays/ performance/ sqlite/ data-structures

I'm looking for some help understanding the performance characteristics of large lists, dicts or arrays in Python. 我在寻找有助于理解Python中大型列表，字典或数组的性能特征的帮助。 I have about 1M key value pairs that I need to store temporarily (this will grow to maybe 10M over the next year). 我大约需要暂时存储100万个键值对（明年将增长到1000万个）。 They keys are database IDs ranging from 0 to about 1.1M (with some gaps) and the values are floats. 它们的键是数据库ID，范围从0到约1.1M（有一些空隙），值是浮点数。

I'm calculating pagerank, so my process is to initialize each ID with a value of 1, then look it up in memory and update it about ten times before saving it back to the database. 我正在计算pagerank，因此我的过程是将每个ID初始化为1，然后在内存中查找并更新大约十次，然后再将其保存回数据库。

I'm theorizing that lists or arrays will be fastest if I use the database ID as the index of the array/list. 我认为如果使用数据库ID作为数组/列表的索引，则列表或数组最快。 This will create a gappy data structure, but I don't understand how fast look ups or updates will be. 这将创建一个松散的数据结构，但我不知道查找或更新将有多快。 I also don't yet understand if there's a big gain to get from using arrays instead of lists. 我还不了解使用arrays而不是列表是否有很大的收获。
Using a dict for this is very natural, with key-value pairs, but I get the impression building the dict the first time would be very slow and memory intensive as it grows to accommodate all the entries. 为此，使用具有键值对的字典是很自然的，但是我第一次建立dict的印象会很慢并且会占用更多内存，因为它会容纳所有条目。
I also read that SQLite might be a good solution for this using the :memory: flag, but I haven't dug into that too much yet. 我还读到了SQLite也许可以使用:memory:标志来解决这个问题，但是我还没有深入研究。

Anyway, just looking for some guidance here. 无论如何，只是在这里寻找一些指导。 Any thoughts would be much appreciated as I'm digging in. 当我深入研究时，任何想法都将不胜感激。

3 个解决方案

Just start with a dictionary. 只是从字典开始。 Even if you are running on WinXP 10 million keys shouldn't be a problem. 即使您在WinXP上运行， 1000万个密钥也不是问题。 But I hope for your sake that you aren't :) 但是我希望你不是:)

A dictionary will be easier to code and probably faster to build and update especially if you are updating the values in random order. 字典将更易于编码，并且可能更容易构建和更新，尤其是如果您以随机顺序更新值时。

It's often best to start coding a prototype and use that to identify performance issues. 通常最好开始编写原型并使用它来识别性能问题。 Your bottleneck will most likely be wherever you are requesting the data from. 无论您从哪里请求数据，瓶颈都将很可能出现。 Not entering or retrieving it from a dictionary. 不输入字典或从字典中检索它。

Looking up data takes O(1) time in a dictionary thanks to the built-in hashing of keys. 借助键的内置散列，在字典中查找数据需要O（1）时间。 Of course for a large amount of data, there will be collisions that take linear time to resolve, but dicts with 10M items should work fine. 当然，对于大量数据，将需要花费线性时间来解决冲突，但是具有1000万个项目的命令应该可以正常工作。 Do not search for data in long lists, because that will take linear (O(n)) time. 不要在长列表中搜索数据，因为这将花费线性（O（n））时间。

However, consider using numpy depending on what you plan to do with your data. 但是，请根据您打算如何处理数据来考虑使用numpy 。 Only to store & retrieve, dicts are perfect, but calculations with tons of data can be largely accelerated using numpy's vectorization instead of using loops . 仅用于存储和检索，字典是完美的，但是使用numpy的矢量化而不是使用loop可以极大地加速大量数据的计算。

SQL comes into sight when you need to do more complicated queries (search for multiple keys or define conditions to match). 当您需要执行更复杂的查询（搜索多个键或定义要匹配的条件）时，就会出现SQL。 For a simple key-value pair, SQL seems to be overkill. 对于简单的键/值对，SQL似乎是过大了。

Well, in general, if you have too much data to keep in memory, you need to use some kind of external storage; 好吧，通常来说，如果您有太多的数据要保留在内存中，则需要使用某种外部存储； and if all your data does fit in memory, you don't need to do anything fancy. 如果所有数据都适合存储在内存中，则无需进行任何花哨的操作。

The biggest problem you're likely to have is if you have more data than your operating system will allow in a single process image; 您可能遇到的最大问题是，如果单个进程映像中的数据量超过操作系统允许的数据量，则可能是最大的问题。 in that case again you will need external storage. 在这种情况下，您将再次需要外部存储。

In both cases this comes down to: use a database, whether sql or no. 在两种情况下，这都归结为：使用数据库，无论是sql还是否。 If it's a sql database, you might like to use an ORM to make that easier. 如果是sql数据库，则您可能希望使用ORM使其更容易。

However, until you hit this problem, just store everything in memory, and serialise to disk. 但是，直到遇到此问题，才将所有内容存储在内存中，然后序列化到磁盘。 I suggest using cPickle or an ORM+sqlite. 我建议使用cPickle或ORM + sqlite。