简体   繁体   中英

Search a large file for data in C/C++

I have a log file which has a format of this kind:

DATE-TIME ### attribute1  ### attribute2 ###attribute3

I have to search this log file for a input attribute(entered from command line) and output the lines that match the entered attribute.
A naive approach may be something like this:

scan the entire file line by line
search for the attribute
print if found, else ignore.

This approach is slow as it would require O(n) comparisons, where n is number of lines which may be very large.
Another approach may be to use a hash-table but keeping such a in-memory hash-table for a big file may not be possible.
So, what is the best feasible solution? How can I possibly index the entire file on various attributes?


The log file may be about 100K lines, almost like the system log files on linux. On One invocation, a user may search for multiple attributes, which is not known until the search on 1st attribute is completed like an interactive console.

Thanks,

If you only search once, you can't do better than O(n).

If a hash index is too big to fit in memory, use an on-disk hash like dbm or gdbm .

EDIT : I'd like to point out that the Berkeley DB tool that KeithB suggested falls into this category on on-disk hashes. Berkeley DB is not a SQL-using relational database.

You could use Berkley DB to index this file. Basically, go through the entire file once, and for each attribute found, store the file position of the attribute. Berkley DB uses an efficient B-Tree implementation ,and does not require storing the entire index in memory, just the parts that are needed. I've done this before with good success.

You can reduce the size of the hash table by only storing hash values and file offsets in it. If the attributes only have a fixed, relatively small number of values, you are more likely to be able to fit the whole hash table in memory. You assign an id to each possible value of the attribute, and then for each id value store a big list of file offsets.

Of course the hash table is only going to be helpful if, within the same run of the program, you do several different searches.

The obvious solution would be to stuff the data in a database, but I assume that the OP is smart enough to have realized that already and has other reasons for specifically requesting a non-database solution to the problem.

Sounds like a job for a database system.

How can I possibly index the entire file on various attributes?

You are really left off to implementing a DB solution under the hood. Your best bets are probably some off-line search algorithms and maintaining an index file.

You may also find Map-Reduce of interest.

There is no efficient method for directly searching a text file of variable length lines. Most efficient methods require an overhead to mutate the data into a form better suited to a more efficient search method. This overhead may not be worthwhile for infrequent searches.

Infrequent Searching

If the file is only searched once or not very frequently, I suggest a brute force line by line search. Other methods may waste time setting up the data for a faster search. For example, using your program to find the first line containing one or more attributes.

Frequent searching

The program is required to search the file many times. In this case, you may want to set up some data structures for better searching. One technique is to create index files. These files contain file positions of attributes, sorted by attribute. Something like [attribute][first occurance][second occurance], etc.

Another alternative is to have your program convert the file into a format that a better tool can use, such as a spreadsheet or a database. One example is Comma Separate Values, or some tools like to have the values separated by a '|'.

Changing the Generator

And yet, the program that generated the log file could be changed to generate spreadsheet or database friendly log files. I did this with an embedded system. I fed the data into the spreadsheet and used spreadsheet functions to analyze the data. Much easier that writing a program to search (and analyze) the log file.

Do the search in a separate thread, it may take a while but I think the trade off of building and then searching a hash table is not worth it in this case.

If the user was repeatedly searching for different attributes it may be worth while, but I doubt it.

For the searching try boost::regex or QRegExp .

For just searching a log file, I think you might be over-engineering this with a database or hash table.

I would build an index using a b-tree or hash table. I would store the index on disk. If you are in control of updates to this file you can also update the index when the file is updated. If you are not in control of updates then use a hash of the file to determine if it needs to be regenerated.

After thinking about this I realized that I commonly search log files with 600k+ lines (100M) using grep on the command line. It is quite fast. So, I don't think 100k is a problem unless your files have a lot more data.

You might also want to look into using log rotate if the log files are getting large.

You could try a straightforward linear search of the file, but launch a worker thread to do your searching. That way, you can spawn multiple worker threads and start them at different points in the file. On a machine with multiple processors/cores, this parallelism should produce a performance boost over a plain, single-threaded linear search.

After you do a search, you may want to store the input parameters and the results of that search in a data file. That way, if the search is repeated in the future, you can use this cached data and will only have to search through the part of the file modified since the last search.

Don't worry. Just scan it.

Seriously. You say this file is 100K lines, which is indeed almost exactly the same size as the /var/log/messages on the computer I'm typing this on. Well, this is a netbook, ie very slow by modern standards -- and yet a straightforward grep of my /var/log/messages is so quick that it is dwarfed by the amount of time taken to print the results to the screen.

So I really don't think you have anything to worry about -- particularly not if this is an interactive process that only runs searches when a human demands it, not some daemon that might be constantly searching for things in the background. You've quite possibly already wasted more time worrying about this problem than you could ever hope to save by implementing an index!

Come on, seriously guys. A database for log files?

Log files change all the time or at the least every day. So what you really probably want is to do some log file rotation of some kind, of which there's many premade and failing that could be done in a few hours if you know even a little perl. You probably really don't need C++ for this, either. It will just make dev time slower and the end result buggier.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM