I have a lot of txt files, around 10GB. What should I use in my program to merge them into one file without duplicates? I want to make sure each line in my output file will be unique.
I was thinking about making some kind of hash tree and use MPI. I want it to be effective.
std::vector<std::string>
works just fine for that).std::map
that maps line hashes (step 3) to std::pair<uint32_t filenumber, size_t byte_start_of_line>
. If your new line hash is already in the hash table, open the specified file, seek
to the specified position, and check whether your new line and the old line are identical or just share the same hash.This only takes the RAM needed for the longest line, plus enough RAM for the filenames + file numbers plus overhead, plus the space for the map, which should be far less than the actual lines. Since 10GB isn't really much text, it's relatively unlikely you'll have hash collisions, so you might as well skip the "check with the existing file" part if you're not after certainty, but a sufficiently high probability that all lines are in your output.
If you don't have requirements to keep the memory usage low, you could just read all the lines from all the files into a std::set
orstd::unordered_set
. An unordered_set
is as the name implies not ordered in any particular way while a set
is (lexicographical sort order). I've chosen a std::set
here, but you can try with a std::unordered_set
to see if that speeds things up a little.
Example:
#include <cerrno>
#include <cstring>
#include <fstream>
#include <iostream>
#include <set>
#include <string>
#include <string_view>
#include <vector>
int cppmain(std::string_view program, std::vector<std::string_view> args) {
if(args.empty()) {
std::cerr << "USAGE: " << program << " files...\n";
return 1;
}
std::set<std::string> result; // to store all the unique lines
// loop over all the filenames the user supplied
for(auto& filename : args) {
// try to open the file
if(std::ifstream ifs(filename.data()); ifs) {
std::string line;
// read all lines and put them in the set:
while(std::getline(ifs, line)) result.insert(line);
} else {
std::cerr << filename << ": " << std::strerror(errno) << '\n';
return 1;
}
}
for(auto line : result) {
// ... manipulate the unique line here ...
std::cout << line << '\n'; // and print the result
}
return 0;
}
int main(int argc, char* argv[]) {
return cppmain(argv[0], {argv + 1, argv + argc});
}
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.