简体   繁体   中英

TCLAP makes multithreaded program slower

TCLAP is a C++ templatized header-only library for parsing command-line arguments.

I'm using TCLAP to process command-line arguments in a multi-threaded program: the arguments are read in the main function, then multiple threads are initiated to work on the task defined by the arguments (some parameters for an NLP task).

I've started showing the amount of words per second processed by the threads, and I've found that if I hard-code the arguments into the main instead of reading them from the cli using TCLAP, the throughput is 6 times faster!

I'm using gcc with the -O2 argument, with which I see speed increases of about 10 times over not optimizing during compilation (when TCLAP is not being used)... so it seems that using TCLAP somehow negates part of the advantage of compiler optimization.

Here's what the main function, the only place I use TCLAP, looks like:

int main(int argc, char** argv)                                                 
{                                                                               
uint32_t mincount;                                                          
uint32_t dim;                                                               
uint32_t contexthalfwidth;                                                  
uint32_t negsamples;                                                        
uint32_t numthreads;                                                        
uint32_t randomseed;                                                        
string corpus_fname;                                                        
string output_basefname;                                                    
string vocab_fname;                                                         

Eigen::initParallel();                                                      

try {                                                                       
TCLAP::CmdLine cmd("Driver for various word embedding models", ' ', "0.1"); 
TCLAP::ValueArg<uint32_t> dimArg("d","dimension","dimension of word representations",false,300,"uint32_t");
TCLAP::ValueArg<uint32_t> mincountArg("m", "mincount", "required minimum occurrence count to be added to vocabulary",false,5,"uint32_t");
TCLAP::ValueArg<uint32_t> contexthalfwidthArg("c", "contexthalfwidth", "half window size of a context frame",false,15,"uint32_t");
TCLAP::ValueArg<uint32_t> numthreadsArg("t", "numthreads", "number of threads",false,12,"uint32_t");
TCLAP::ValueArg<uint32_t> negsamplesArg("n", "negsamples", "number of negative samples for skipgram model",false,15,"uint32_t");
TCLAP::ValueArg<uint32_t> randomseedArg("s", "randomseed", "seed for random number generator",false,2014,"uint32_t");
TCLAP::UnlabeledValueArg<string> corpus_fnameArg("corpusfname", "file containing the training corpus, one paragraph or sentence per line", true, "corpus", "corpusfname");
TCLAP::UnlabeledValueArg<string> output_basefnameArg("outputbasefname", "base filename for the learnt word embeddings", true, "wordreps-", "outputbasefname");
TCLAP::ValueArg<string> vocab_fnameArg("v", "vocabfname", "filename for the vocabulary and word counts", false, "wordsandcounts.txt", "filename");
cmd.add(dimArg);                                                            
cmd.add(mincountArg);                                                       
cmd.add(contexthalfwidthArg);                                               
cmd.add(numthreadsArg);                                                     
cmd.add(randomseedArg);                                                     
cmd.add(corpus_fnameArg);                                                   
cmd.add(output_basefnameArg);                                               
cmd.add(vocab_fnameArg);                                                    
cmd.parse(argc, argv);                                                      

mincount = mincountArg.getValue();                                          
dim = dimArg.getValue();                                                    
contexthalfwidth = contexthalfwidthArg.getValue();                          
negsamples = negsamplesArg.getValue();                                      
numthreads = numthreadsArg.getValue();                                      
randomseed = randomseedArg.getValue();                                      
corpus_fname = corpus_fnameArg.getValue();                                  
output_basefname = output_basefnameArg.getValue();                          
vocab_fname = vocab_fnameArg.getValue();                                    
}                                                                           
catch (TCLAP::ArgException &e) {};         

/*                                                                          
uint32_t mincount = 5;                                                      
uint32_t dim = 50;                                                          
uint32_t contexthalfwidth = 15;                                             
uint32_t negsamples = 15;                                                   
uint32_t numthreads = 10;                                                   
uint32_t randomseed = 2014;                                                 
string corpus_fname = "imdbtrain.txt";                                      
string output_basefname = "wordreps-";                                      
string vocab_fname = "wordsandcounts.txt";                                  
*/                                                                          

string test_fname = "imdbtest.txt";                                         
string output_fname = "parreps.txt";                                        
string countmat_fname = "counts.hdf5";                                      
Vocabulary * vocab;                                                                                                              

vocab = determineVocabulary(corpus_fname, mincount);                        
vocab->dump(vocab_fname);                                                   

Par2VecModel p2vm = Par2VecModel(corpus_fname, vocab, dim, contexthalfwidth, negsamples, randomseed);
p2vm.learn(numthreads);                                                     
p2vm.save(output_basefname);                                                
p2vm.learnparreps(test_fname, output_fname, numthreads); 

}    

The only place multithreading is used is in the Par2VecModel::learn function:

void Par2VecModel::learn(uint32_t numthreads) {                                 
thread* workers;                                                            
workers = new thread[numthreads];                                           
uint64_t numwords = 0;                                                      
bool killflag = 0;                                                          
uint32_t randseed;                                                          

ifstream filein(corpus_fname.c_str(), ifstream::ate | ifstream::binary);    
uint64_t filesize = filein.tellg();                                         

fprintf(stderr, "Total number of in vocab words to train over: %u\n", vocab->gettotalinvocabwords());

for(uint32_t idx = 0; idx < numthreads; idx++) {                            
    randseed = eng();                                                       
    workers[idx] = thread(skipgram_training_thread, this, numthreads, idx, filesize, randseed, std::ref(numwords));
}                                                                           

thread monitor(monitor_training_thread, this, numthreads, std::ref(numwords), std::ref(killflag));

for(uint32_t idx = 0; idx < numthreads; idx++)                              
    workers[idx].join();                                                    

killflag = true;                                                            
monitor.join();                                                             
}

This section does not involve TCLAP at all, so what's going on? (I'm also using c++11 features, so have the -std=c++11 flag, if that makes a difference)

So this has been open for a long time, and this suggestion may no longer be useful, but I'd first check what happens if you replace TCLAP with a "simple" parser (ie just feed the arguments in on the command line in a specific fixed order and convert them to the right type). It's highly unlikely that the issue is due to TCLAP (ie I can't imagine any mechanism for such behavior). However, it's conceivable that with hard-coded values, the compiler is able to do some compile-time optimizations that aren't possible when those values must be variables. However, the degree of performance difference seems somewhat pathological, so I'm still skeptical that there's not something else going on.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM