简体   繁体   English

搜索千兆字节数据的最快方法?

[英]Fastest Way to search in Gigabytes of data?

I have a cvs file of size 8-12 GB, and I would like to be able to search the first column of the file and retrieve the whole rows if there is a match. 我有一个大小为8-12 GB的cvs文件,如果存在匹配项,我希望能够搜索文件的第一列并检索整行。 I would like to do the search for a set of more than 100K keys every time and retrieve the corresponding record for them. 我想每次搜索一组超过100K的键,并为它们检索相应的记录。

There are a couple of approaches I can choose: 我可以选择两种方法:

1) use a simple grep for each key in the file ==> 100K grep commands 1)对文件中的每个键使用一个简单的grep ==> 100K grep命令

2) make a SQL based database and index the first column then: a) search for each key by one select query. 2)创建一个基于SQL的数据库并索引第一列,然后:a)通过一个选择查询搜索每个键。 b) make a temporary table and insert all the keys to it and then do a set membership b)制作一个临时表并插入所有键,然后进行设置

3) make a hash function, such as Python dictionary and then search it by key. 3)创建一个哈希函数,例如Python字典,然后按键搜索它。 But I need to load it into the memory every time I need to do a bulk of queries (I don't want it to always occupy the memory) 但每次需要执行大量查询时,我都需要将其加载到内存中(我不希望它总是占用内存)

I'm not sure which method is more efficient? 我不确定哪种方法更有效? Or any better options which I'm not aware of. 还是我不知道的更好的选择。

You can read chunks of the csv iterated using pandas. 您可以读取使用pandas迭代的csv块。 Perhaps this solution can work for you: How to read a 6 GB csv file with pandas 也许此解决方案可以为您工作: 如何使用熊猫读取6 GB的csv文件

The fastest solution (if you have plenty of RAM) would be to just mmap the whole file. 最快的解决方案(如果您有足够的RAM)将只是mmap整个文件。

What would certainly work is to read the file one line at a time: 当然可以的是一次读取一行文件:

# keys is an iterable of keys.
sep = ';' # The separator used in the CSV.
with open('foo.csv') as f:
    for line in f:
        to = line.find(sep)
        if line[:to] in keys:
            # do something

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM