Efficiently get all lines starting with given string for a large text file

Question

I have a large text file with around 700k lines.

For a given string, I would like to be able to efficiently find all lines in the file that start with the string. I would like to query it repeatedly and so each query should be fast and I am not so concerned about a larger set up time initially.

I'm guessing that I could do this more efficiently by transforming the file so that the lines are already in alphabetical order? If so what's a good way to do this? Or is there a different data structure I could consider?

Once the data has been prepared, what is an efficient way to search?

I would be comfortable doing something basic with regular expressions or reading line by line and testing the line start, but both of these solutions seem slack? It seems like there should be a well understood algorithm for this kind of thing?

Answer 1

There are two questions I need to ask before giving you the best solution:

Is the text in lexicographical order ?
If not, how much accuracy is in the alphabetical order? (how many characters in a line until mistakes can happen in the sorting)

If your file is in lexicographical order, you're in luck. You'll be able to use a modification of a binary search to narrow down the lines that start with your given string.

If your file is only in alphabetical order, you can narrow it down like the first solution only until it's "out of accuracy". After that, you'll sadly need to search one by one on those lines.

I'll try my best to build you a fitting code:

lines = <All of your lines, considering you can index them>
givenstring = <Your string>
low = 0
high = len(lines)
i = 0
lastinstance = len(lines)

while i < len(givenstring)-1:
    #Finding the first instance:
    while low < high:
        mid = (low+high)//2
        if (mid == 0 or ord(givenstring[i]) > ord(lines[mid-1][i])) and ord(lines[mid][i]) == ord(givenstring[i]):
            firstinstance = mid
            break
        elif ord(givenstring[i]) > ord(lines[mid][i]):
            low = mid + 1
        else:
            high = mid

    #Finding the last instance:
    low = firstinstance
    high = lastinstance

    while low < high:
        mid = (low+high)//2
        if (mid == len(lines)-1 or ord(givenstring[i]) < ord(lines[mid+1][i])) and ord(lines[mid][i]) == ord(givenstring[i]):
            lastinstance = mid
            break
        elif ord(givenstring[i]) > ord(lines[mid][i]):
            low = mid + 1
        else:
            high = mid

    low = firstinstance
    high = lastinstance
    i += 1


print(firstinstance)
print(lastinstance)

Efficiently get all lines starting with given string for a large text file

Question

1 answers

solution1
0 2022-12-31 18:27:07

Efficiently get all lines starting with given string for a large text file

Question

1 answers

solution1 0 2022-12-31 18:27:07

solution1
0 2022-12-31 18:27:07