搜索大型排序文本文件的最快和最有效的方法

Question

I have a large static text/csv file, which contains approx 100k rows (2MB).我有一个大型静态文本/csv 文件，其中包含大约 10 万行 (2MB)。 It's essentially a dictionary, and I need to perform regular lookups on this data in Python.它本质上是一个字典，我需要在 Python 中对这些数据执行定期查找。

The format of the file is:文件格式为：

    key         value1       value2     
    alpha       x1           x2
    alpha beta  y1           y2
    gamma       z1           z2  
    ...

The keys can be multi-word strings.键可以是多字串。
The list is sorted in alphabetical order by the key该列表按键的字母顺序排序
The values are strings值是字符串

This is part of a web application where every user will be looking up 100-300 keys at a time, and will expect to get both value 1 and value 2 for each of those keys.这是 Web 应用程序的一部分，其中每个用户将一次查找 100-300 个键，并且期望为每个键获得值 1 和值 2。 There will be up to 100 users on the application each looking up those 100-300 keys over the same data.应用程序上将有多达 100 个用户，每个用户在同一数据上查找这 100-300 个键。

I just need to return the first exact match.我只需要返回第一个完全匹配。 For example, if the user searched for the keys [alpha, gamma] , I just need to return [('x1','x2'), ('z1','z2')] , which represents the first exact match of 'alpha' and 'gamma'.例如，如果用户搜索键[alpha, gamma] ，我只需要返回[('x1','x2'), ('z1','z2')] ，它表示第一个完全匹配的“阿尔法”和“伽马”。

I've been reading about the options I have, and I'd really love your input on which of the following approaches is best for my use case.我一直在阅读有关我拥有的选项的信息，我真的很喜欢您对以下哪种方法最适合我的用例的意见。

Read the file once into an ordered set, and perform the 200 or so lookups.将文件一次读入有序集合，并执行 200 次左右的查找。 However, for every user using the application (~100), the file will be loaded into memory.但是，对于每个使用该应用程序的用户（~100），该文件将被加载到内存中。
Read the file once into a list, and use binary search (eg bisect ).将文件读入一次列表，并使用二分查找（例如bisect ）。 Similar problem as 1.) the file will be loaded into memory for every user who needs to do a search.与 1.) 类似的问题，对于每个需要进行搜索的用户，该文件将被加载到内存中。
Don't read the entire file into memory, and just read the file one line at a time.不要将整个文件读入内存，而只是一次读取文件一行。 I can split the .csv into 26 files by each letter (a.csv, b.csv, ...) to speed this up a bit.我可以按每个字母 (a.csv, b.csv, ...) 将 .csv 分成 26 个文件，以加快速度。
Whoosh is a search library that caught my eye since it created an index once. Whoosh是一个引起我注意的搜索库，因为它创建了一次索引。 However, I'm not sure if it's applicable for my use case at all as it looks like a full text search and I can't limit to just looking up the first column.但是，我不确定它是否完全适用于我的用例，因为它看起来像全文搜索，而且我不能仅限于查找第一列。 If this specific library is not an option, is there any other way I can create a reusable index in Python to support these kinds of lookups?如果这个特定的库不是一个选项，有没有其他方法可以在 Python 中创建一个可重用的索引来支持这些类型的查找？

I'm really open to ideas and I'm in no way restricted to the four options above!我真的很乐于接受各种想法，而且我绝不局限于上述四个选项！

Thank you :)谢谢：）

Answer 1

How about something similar to approach #2.类似于方法#2 的东西怎么样。 You could still read the file into memory but instead of storing it into a list and using binary search for searching up keys, you could store the file into a hash map .您仍然可以将文件读入内存，但不是将其存储到列表中并使用二进制搜索来搜索键，而是可以将文件存储到哈希映射中。

The benefit of doing this is to take advantage of a hash map's average lookup time of O(1) with a worst case of O(n) .这样做的好处是利用哈希映射的平均查找时间O(1)和O(n)的最坏情况。 The time complexity benefit and justification can be found here and here .时间复杂度的好处和理由可以在这里和这里找到。 Since you're only looking up keys, having constant lookup time would be a great way to search through the file.由于您只查找键，因此具有恒定的查找时间将是搜索文件的好方法。 This method would also be faster than binary search's average O(log n) search time.这种方法也比二分搜索的平均O(log n)搜索时间快。

You could store your file as您可以将文件存储为

table = {
    key1: (value1, value2),
    key2: (value1, value2),
    key2: (value1, value2)
}

Note this method is only viable if your keys are all distinct with no duplicate keys.请注意，此方法仅适用于您的所有键都不同且没有重复键的情况。

搜索大型排序文本文件的最快和最有效的方法

问题描述

1 个解决方案

解决方案1
1 2019-04-03 00:47:50

搜索大型排序文本文件的最快和最有效的方法

问题描述

1 个解决方案

解决方案1 1 2019-04-03 00:47:50

解决方案1
1 2019-04-03 00:47:50