简体   繁体   中英

Quickest and most efficient way to search large sorted text file

I have a large static text/csv file, which contains approx 100k rows (2MB). It's essentially a dictionary, and I need to perform regular lookups on this data in Python.

The format of the file is:

    key         value1       value2     
    alpha       x1           x2
    alpha beta  y1           y2
    gamma       z1           z2  
    ...
  • The keys can be multi-word strings.
  • The list is sorted in alphabetical order by the key
  • The values are strings

This is part of a web application where every user will be looking up 100-300 keys at a time, and will expect to get both value 1 and value 2 for each of those keys. There will be up to 100 users on the application each looking up those 100-300 keys over the same data.

I just need to return the first exact match. For example, if the user searched for the keys [alpha, gamma] , I just need to return [('x1','x2'), ('z1','z2')] , which represents the first exact match of 'alpha' and 'gamma'.

I've been reading about the options I have, and I'd really love your input on which of the following approaches is best for my use case.

  1. Read the file once into an ordered set, and perform the 200 or so lookups. However, for every user using the application (~100), the file will be loaded into memory.

  2. Read the file once into a list, and use binary search (eg bisect ). Similar problem as 1.) the file will be loaded into memory for every user who needs to do a search.

  3. Don't read the entire file into memory, and just read the file one line at a time. I can split the .csv into 26 files by each letter (a.csv, b.csv, ...) to speed this up a bit.

  4. Whoosh is a search library that caught my eye since it created an index once. However, I'm not sure if it's applicable for my use case at all as it looks like a full text search and I can't limit to just looking up the first column. If this specific library is not an option, is there any other way I can create a reusable index in Python to support these kinds of lookups?

I'm really open to ideas and I'm in no way restricted to the four options above!

Thank you :)

How about something similar to approach #2. You could still read the file into memory but instead of storing it into a list and using binary search for searching up keys, you could store the file into a hash map .

The benefit of doing this is to take advantage of a hash map's average lookup time of O(1) with a worst case of O(n) . The time complexity benefit and justification can be found here and here . Since you're only looking up keys, having constant lookup time would be a great way to search through the file. This method would also be faster than binary search's average O(log n) search time.

You could store your file as

table = {
    key1: (value1, value2),
    key2: (value1, value2),
    key2: (value1, value2)
}

Note this method is only viable if your keys are all distinct with no duplicate keys.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM