计算大型列表的方法

Question

Suppose, I need to count collisions for various schemes of hashes. 假设，我需要计算各种哈希方案的冲突。 Number of elements in input sequence is 1e10^8 and more. 输入序列中的元素数量为1e10 ^ 8或更多。 Don't know how to count this analytically, so using brute-force. 不知道如何分析这个，所以使用蛮力。

It's obvious not to keep this list of hashes in RAM. 很明显，不要在RAM中保留这个哈希列表。 That is the best way to write a code for my needs? 那是为我的需求编写代码的最佳方法吗？ Should i dump it in database or something? 我应该把它转储到数据库或什么的？ What libraries are preferred to use? 哪些库更适合使用？

Thank you! 谢谢！

Answer 1

I'd suggest keeping a set of files, each one named with a prefix of the hashes contained within it (for example, if you use a prefix length of 6, then the file named ffa23b.txt might contain the hashes ffa23b11d4334 , ffa23b712f3 , et cetera). 我建议保留一组文件，每个文件都以其中包含的哈希前缀命名（例如，如果使用前缀长度为6，则名为ffa23b.txt的文件可能包含哈希值ffa23b11d4334 ， ffa23b712f3 ，等等）。 Each time you read a hash, you append it to the file with the name corresponding to the first N characters of the hash. 每次读取哈希时，都会将其附加到文件中，其名称对应于哈希的前N个字符。

You can also use bloom filters to quickly rule out a large fraction of the hashes as unique, without having to store all of the hashes in memory. 您还可以使用bloom过滤器快速排除大部分哈希值，因为它不需要将所有哈希值存储在内存中。 That way, you only have to fall back to searching through a given prefix file if the check against the bloom filter says that you might have actually seen it before - something that will happen rarely. 这样，如果针对bloom过滤器的检查显示您之前可能已经实际看过它，那么您只需要回过头来搜索给定的前缀文件 - 很少会发生这种情况。

Answer 2

Short answer : if you have some gigabytes of RAM, use Python dictionaries, it's the easiest way to implement (and probably the faster to run). 简短的回答 ：如果你有几千兆字节的RAM，使用Python字典，这是最简单的实现方式（可能运行得更快）。 You can do something like this: 你可以这样做：

>>> mydict = {}
>>> for i in some_iterator:
        mydict[i] = ''

And then check if the key exists in the mapping: 然后检查映射中是否存在密钥：

>>> 0 in mydict
True

>>> 123456789 in mydict
False

Long answer : you can use a persistent key-value store, like GDBM (it looks like Berkeley DB) or another kind of database -- but this approach will be way slower than using just Python dictionaries; 龙答：你可以使用一个持久的key-value存储，就像GDBM （它看起来像的Berkeley DB）或其他类型的数据库-但是这种方法会比只使用Python字典方式比较慢; on the other hand, with this approach you'll have persistance (if you need). 另一方面，通过这种方法，你将具有持久性（如果你需要）。

You can understand GDBM as a dictionary (key-value store) that is persisted in a single file. 您可以将GDBM理解为持久保存在单个文件中的字典（键值存储）。 You can use it as follows: 您可以按如下方式使用它：

>>> import gdbm
>>> kv = gdbm.open('my.db', 'cf')

Then the file my.db will be created (see Python GDBM documentation to understand what cf means). 然后将创建文件my.db （请参阅Python GDBM文档以了解cf含义）。

But it has some limitations, as supporting only strings as keys and values: 但它有一些限制，因为只支持字符串作为键和值：

>>> kv[0] = 0
Traceback (most recent call last)
[...]
TypeError: gdbm mappings have string indices only

>>> kv['0'] = 0
Traceback (most recent call last)
[...]
TypeError: gdbm mappings have string elements only

>>> kv['0'] = '0'

You can populate a gdbm database with all your keys having a dummy value: 您可以使用具有虚拟值的所有键填充gdbm数据库：

>>> for i in some_iterator:
        kv[str(i)] = ''

And then check if the key exists in the mapping: 然后检查映射中是否存在密钥：

>>> '0' in kv
True

>>> '123456789' in kv
False

计算大型列表的方法

问题描述

2 个解决方案

解决方案1
2 已采纳 2013-04-01 07:06:54

解决方案2
1 2013-04-01 08:31:02

计算大型列表的方法

问题描述

2 个解决方案

解决方案1 2 已采纳 2013-04-01 07:06:54

解决方案2 1 2013-04-01 08:31:02

解决方案1
2 已采纳 2013-04-01 07:06:54

解决方案2
1 2013-04-01 08:31:02