简体   繁体   中英

What is the fastest way to search strings in a file?

I have a very large file (100 MB) with strings in it, and I am searching a performant way to query if a given string is available in the file. The whole line should be compared against the input string.

My idea is that a program loads the file, and after that, it can be queried if the string exists or not.

void loadfile(string filename);
bool stringAvailable(string str);

The loadfile() function does not need to be fast, since it is called occasionally. But stringAvailable() needs to be as performant as possible.

At the moment I have tried:

1. Let the linux command line tools do the job for me:

system("cat lookup | grep \\"^example$\\"");

=> Not very fast.

2. Having a MySQL database instead of a file (I tried MyISAM and InnoDB) and query it like SELECT count(*) FROM lookup WHERE str = 'xyz'

=> Very fast, but it could be still faster. Also, it would be better to have a program which is not dependent on a DBMS.

3. Having an array of strings ( string[] ary ) and compare all values in a for() loop.

=> Not very fast. I guess it can be optimized with hashtables, which I am currently experimenting.

Are there other possibilities to make the process even more performant?

Store all the lines from the file in a std::unordered_set .

#include <iostream>
#include <unordered_set>
#include <string>

int main(int argc, char **argv)
{
    std::unordered_set<std::string> lines;
    lines.insert("line 1");
    lines.insert("line 2");

    std::string needle = argv[1];
    if (lines.find(needle) != lines.end())
        std::cout << "found\n";
    else
        std::cout << "NOT found\n";

    return 0;
}

First of all load the file into the memory. I'm guessing you have enough.

Then I would try a linear search within the memory. If you start looking for the first character stop there and look for the consecutive characters you are looking for. If they consecutive characters do not match continue searching with the first character and so on.

Does the file has to have a pattern or be sorted at certain conditions. If that's the case you might have chances to optimizes even further.

Also try to use string references like this:

void loadfile(const string &filename);
bool stringAvailable(const string &str);

It might avoid unnecessary copies.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM