简体   繁体   中英

C++ and reading large txt files

I have a lot of txt files, around 10GB. What should I use in my program to merge them into one file without duplicates? I want to make sure each line in my output file will be unique.

I was thinking about making some kind of hash tree and use MPI. I want it to be effective.

  1. build a table of files, so you can give every filename simply a number (a std::vector<std::string> works just fine for that).
  2. For each file in a table: open it, do the following:
  3. read a line. Hash the line.
  4. Have a std::map that maps line hashes (step 3) to std::pair<uint32_t filenumber, size_t byte_start_of_line> . If your new line hash is already in the hash table, open the specified file, seek to the specified position, and check whether your new line and the old line are identical or just share the same hash.
  5. if identical, skip; if different or not yet present: add new entry to map, write line to output file
  6. read next line (ie, go to step 3)

This only takes the RAM needed for the longest line, plus enough RAM for the filenames + file numbers plus overhead, plus the space for the map, which should be far less than the actual lines. Since 10GB isn't really much text, it's relatively unlikely you'll have hash collisions, so you might as well skip the "check with the existing file" part if you're not after certainty, but a sufficiently high probability that all lines are in your output.

If you don't have requirements to keep the memory usage low, you could just read all the lines from all the files into a std::set orstd::unordered_set . An unordered_set is as the name implies not ordered in any particular way while a set is (lexicographical sort order). I've chosen a std::set here, but you can try with a std::unordered_set to see if that speeds things up a little.

Example:

#include <cerrno>
#include <cstring>
#include <fstream>
#include <iostream>
#include <set>
#include <string>
#include <string_view>
#include <vector>

int cppmain(std::string_view program, std::vector<std::string_view> args) {
    if(args.empty()) {
        std::cerr << "USAGE: " << program << " files...\n";
        return 1;
    }

    std::set<std::string> result;   // to store all the unique lines

    // loop over all the filenames the user supplied
    for(auto& filename : args) {

        // try to open the file
        if(std::ifstream ifs(filename.data()); ifs) {
            std::string line;

            // read all lines and put them in the set:
            while(std::getline(ifs, line)) result.insert(line);
        } else {
            std::cerr << filename << ": " << std::strerror(errno) << '\n';
            return 1;
        }
    }

    for(auto line : result) {
        // ... manipulate the unique line here ...

        std::cout << line << '\n'; // and print the result
    }
    return 0;
}

int main(int argc, char* argv[]) {
    return cppmain(argv[0], {argv + 1, argv + argc});
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM