简体   繁体   English

C++ 堆排序向量<string,int>

[英]C++ heap sort of vector<string,int>

I can not figure out where I'm having my problem with my heap sort.我不知道我的堆排序问题在哪里。
The program takes a filename from the command line, imports the words into a vector then that vector is turned into a vector pair of vector<string,int> where string is the word and int is the count of how many instances of that word are in the file.该程序从命令行获取文件名,将单词导入向量,然后将该向量转换为vector<string,int>的向量对,其中 string 是单词,int 是该单词的实例数在文件中。

The vector<PAIR> is then sorted by either the string (value or v) or by int (key or k). vector<PAIR>然后按字符串(值或 v)或按整数(键或 k)排序。 My sorting by Key works fine however sort by value is off.我按键排序工作正常,但按值排序关闭。 I suspect I'm missing an if statement in max_heapify when sorting by value.我怀疑在按值排序时在 max_heapify 中缺少 if 语句。 Here's my code:这是我的代码:

main.cpp主程序

#include <fstream>
#include <iostream>
#include <stdlib.h>
#include <vector>
#include <string>
#include <string.h>
#include <stdio.h>
#include <map>
#include <time.h>
#include "readwords.h"

using namespace std;

readwords wordsinfile;
vector<string> allwords;
bool times;
char *filename;
timespec timestart,timeend;
vector< pair<string,int> > allwords_vp;

timespec diffclock(timespec start, timespec end);

int main ( int argc, char *argv[] ) {

    filename = argv[1];

    //Lets open the file 
    ifstream ourfile2(filename);

    //Lets get all the words using our requirements
    allwords = wordsinfile.getwords(ourfile2);
    //Convert all the words from file and count how many times they 
    //appear.  We will store them in a vector<string,int> string 
    //being the word and int being how many time the word appeared
    allwords_vp = wordsinfile.count_vector(allwords);

    cout << "HeapSort by Values" << endl;
        if (times) {
        clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &timestart);
                wordsinfile.heapsort(const_cast<char *>("v"));
        clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &timeend);
                cout << "HeapSort by Values ran in "
                     << diffclock(timestart,timeend).tv_nsec << " nanosecond or "
                     << diffclock(timestart,timeend).tv_nsec/1000 << " millisecond"
                     << endl;
        } else {
                wordsinfile.heapsort(const_cast<char *>("v"));
        }

    cout << "HeapSort by Keys" << endl;
        if (times) {
        clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &timestart);
                wordsinfile.heapsort(const_cast<char *>("k"));
        clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &timeend);
                cout << "HeapSort by Keys ran in "
                     << diffclock(timestart,timeend).tv_nsec << " nanosecond or "
                     << diffclock(timestart,timeend).tv_nsec/1000 << " millisecond"
                     << endl;
        } else {
                wordsinfile.heapsort(const_cast<char *>("k"));
        }
}

timespec diffclock(timespec start, timespec end) {
    timespec temp;
    if ((end.tv_nsec-start.tv_nsec)<0) {
        temp.tv_sec = end.tv_sec-start.tv_sec-1;
        temp.tv_nsec = 1000000000+end.tv_nsec-start.tv_nsec;
    } else {
        temp.tv_sec = end.tv_sec-start.tv_sec;
        temp.tv_nsec = end.tv_nsec-start.tv_nsec;
    }
    return temp;
}

readwords.h readwords.h

#ifndef READWORDS_H
#define READWORDS_H

#include <vector>
#include <map>
#include <utility>
#include <time.h>

typedef std::pair<std::string, int> PAIR;

bool isasciifile(std::istream& file);

class readwords {
    private:
         std::vector<PAIR> vp;
    public:
         std::vector<std::string> getwords(std::istream& file);
         std::vector<PAIR> count_vector(std::vector<std::string> sv);
         void print_vectorpair(std::vector<PAIR> vp);
         void print_vector(std::vector<std::string> sv);     
         void heapsort(char how[]);
         void buildmaxheap(std::vector<PAIR> &vp, int heapsize, char how[]);
         void max_heapify(std::vector<PAIR> &vp, int i, int heapsize, char how[]);
         void swap_pair(PAIR &p1, PAIR &p2);
};

readwords.cpp readwords.cpp

#include <fstream>
#include <iostream>
#include <map>
#include "readwords.h"
#include <vector>
#include <string>
#include <utility>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <time.h>

//using std::vector;
using namespace std;

typedef pair<string, int> PAIR;

// Do we have a ASCII file?
// Lets test the second 10 chars to make sure
// This method is flawed if the file is less than 10 chars
bool isasciifile(std::istream& file) {
    int c = 0;
    bool foundbin = false;
    for(c=0; c < 10;c++) {
        if(!isprint(file.get())){
            // Looks like we found a non ASCII file, or its empty.
            foundbin = true;
        }
    }
    return foundbin;
}

// This is our workhorse as it splits up the words based on our criteria and
// passes them back as a vector of strings.
vector<string> readwords::getwords(std::istream& file) {
    char c;
    string aword;
    vector<string> sv;

            //Let go through the file till the end  
                        while(file.good()) {
                                c = file.get();
                                if (isalnum(c)) {
                    //convert any uppercase to lowercase
                    if(isupper(c)) {
                                                c = (tolower(c));
                                        }
                    //if its a space lets go onto the next char
                                        if(isspace(c)) { continue; }
                    //everything looks good lets add the char to our word
                                        aword.insert(aword.end(),c);
                                } else {
                    //its not a alphnum or a space so lets skip it
                    if(!isspace(c)) { continue; }
                    //reset our string and increment
                                        if (aword != "") {sv.push_back(aword);}
                                        aword = "";
                                        continue;
                                }
                        }
    return sv;
}

vector<PAIR> readwords::count_vector(vector<string> sv) {
    unsigned int i = 0;
    int j = 0;
    int match = 0;

    // cout << "Working with these string: " << endl;
    // print_vector(sv);

    for (i=0; i < sv.size(); i++) {
        // cout << "count of i: " << i << " word is: " << sv.at(i) << endl;

        match = 0;
        if(readwords::vp.size() == 0) {
            readwords::vp.push_back(make_pair(sv.at(i),1)); continue;
        }

        for (j=readwords::vp.size() - 1; j >= 0; --j) {
            if (sv.at(i) == readwords::vp.at(j).first) {
                // cout << "Match found with: " << sv.at(i) << endl;;
                readwords::vp.at(j).second = readwords::vp.at(j).second + 1;
                match = 1;
            } 

            // cout << "Value of j and match: " << j << match << endl;
            if ( j == 0 && match == 0) {
                // cout << "Match found at end with: " << sv.at(i) << endl;;
                readwords::vp.push_back(make_pair(sv.at(i),1));
            }
        }
    }

    //Prob need to sort by first data type then second here, prior to sort functions.
    //Might not be the best place as the sort functions would alter it, if not here
    //then each sort requires to do secondary search

    return readwords::vp;
}

void readwords::print_vectorpair(vector<PAIR> vp) {
    unsigned int i = 0;

    for (i=0; i < vp.size(); ++i) {
        cout << vp.at(i).first << " " << vp.at(i).second << endl;
    }
}

void readwords::print_vector(vector<string> sv) {
    unsigned int i = 0;

    for (i=0; i < sv.size(); ++i) {
        cout << sv.at(i) << endl;
    }
}

void readwords::heapsort(char how[]) {
    int heapsize = (readwords::vp.size() - 1);

    buildmaxheap(readwords::vp, heapsize, how);

    for(int i=(readwords::vp.size() - 1); i >= 0; i--) {
        swap(readwords::vp[0],readwords::vp[i]);
        heapsize--;
        max_heapify(readwords::vp, 0, heapsize, how);
    }

    print_vectorpair(readwords::vp);
}

void readwords::buildmaxheap(vector<PAIR> &vp, int heapsize, char how[]) {

    for(int i=(heapsize/2); i >= 0 ; i--) {
        max_heapify(vp, i, heapsize, how);
    }
}

void readwords::max_heapify(vector<PAIR> &vp, int i, int heapsize, char how[]) {
    int left = ( 2 * i ) + 1;
    int right = left + 1;
    int largest;

    if(!strcmp(how,"v")) {  
        if(left <= heapsize && vp.at(left).second >= vp.at(i).second ) {
            if( vp.at(left).first >= vp.at(i).first ) {
                largest = left;
            } else {
                largest = i;
            }
        } else {
            largest = i;
        }

        if(right <= heapsize && vp.at(right).second >= vp.at(largest).second) {
            if( vp.at(right).first >= vp.at(largest).first) {
                largest = right;
            }
        }   
    }

    if(!strcmp(how,"k")) {  
        if(left <= heapsize && vp.at(left).first > vp.at(i).first) {
            largest = left;
        } else {
            largest = i;
        }

        if(right <= heapsize && vp.at(right).first > vp.at(largest).first) {
            largest = right;
        }   
    }

    if(largest != i) {
        swap(vp[i], vp[largest]);
        max_heapify(vp, largest, heapsize, how);
    }
}

To sort a vector<pair<string, int> > by values, consider adding vector<pair<int, string> >要按值对vector<pair<string, int> >进行排序,请考虑添加vector<pair<int, string> >

vector<pair<int, string> > v(orignal.size());
for (int i = 0; i < v.size(); ++i) v[i] = make_pair(original[i].second, original[i].first);
sort(v.begin(), v.end());

The vector is then sorted by either the string (value or v) or by int (key or k).然后向量按字符串(值或 v)或按整数(键或 k)排序。

That description doesn't match the code, sorting with a how parameter of "k" sorts by the first component only, which is the string , and sorting with "v" as how parameter takes both components into account.该描述不代码匹配,用一个分拣how的参数"k"仅由第一组件,它是的种类string ,并与排序"v"作为how参数采用两种组分考虑在内。

I think it's a rather bad idea to pass a char[] to determine the sorting criterion, it should be a comparator function, so you need only one implementation in max_heapify .我认为通过char[]来确定排序标准是一个相当糟糕的主意,它应该是一个比较器函数,因此您只需要在max_heapify实现一个。

My sorting by Key works fine however sort by value is off.我按键排序工作正常,但按值排序关闭。 I suspect I'm missing an if statement in max_heapify when sorting by value.我怀疑在按值排序时在 max_heapify 中缺少 if 语句。

The problem is that a heap sort needs a total ordering or it won't sort properly.问题是堆排序需要总排序,否则排序不正确。

Your conditions你的条件

if(left <= heapsize && vp.at(left).second >= vp.at(i).second ) {
    if( vp.at(left).first >= vp.at(i).first ) {
        largest = left;
    } else {
        largest = i;
    }
} else {
    largest = i;
}

check whether both components of vp.at(left) (resp. right ) are at least as large as the corresponding component of vp.at(i) , resulting in the product partial ordering , two general pairs are not comparable, and in that case, your max_heapify doesn't do anything.检查vp.at(left) (resp. right ) 的两个分量是否至少与vp.at(i)的对应分量一样大,导致乘积偏序,两个一般对没有可比性,即情况下,您的max_heapify不做任何事情。

Example, for <"a",3> , <"b",2> and <"c",1> in the positions i, left, right , in whichever order, your max_heapify sets largest to i .例如,对于位置i, left, right中的<"a",3><"b",2><"c",1> ,无论顺序如何,您的max_heapifylargest设置为i

If your sorting by "v" is meant to sort based on the int component first, and in case of a tie, take the string component into account, you'd need to distinguish the cases vp.at(left).second > vp.at(i).second and equality (for right too, of course).如果您按"v"排序的目的是首先基于int组件进行排序,并且在出现平局的情况下,考虑string组件,则您需要区分vp.at(left).second > vp.at(i).second和平等(为right过,当然)。 For example例如

if(left <= heapsize && vp.at(left).second >= vp.at(i).second ) {
    if(vp.at(left).second > vp.at(i).second || vp.at(left).first >= vp.at(i).first ) {
        largest = left;
    } else {
        largest = i;
    }
} else {
    largest = i;
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM