简体   繁体   中英

mistake in C++ function that returns most common character in a string. Multibyte characters?

Pursuing a job, I was asked to solve a problem on HackerRank.com, to write a function that accepts a string, counts the characters in it and returns the most common character found. I wrote my solution, got the typos fixed, and it works with my test cases and theirs, except it fails "Test 7". Because its an interview deal, HackerRank doesn't tell me the failure details, just that it failed.

I used far too much time trying to figure out why. I've triple checked for off-by-one errors, wrote the code for 8 bit chars but tried accepting 16 bit values without changing the result. Here's my code. I cannot give the error, just that there is one.

Could it be multi-byte characters?

How can I create a testcase with a 2 byte or 3 byte character?

I put in some display dump code and what comes out is exactly what you'd expect. I have Mac XCode IDE on my desktop, any suggestions are welcome!

/*
 * Complete the function below.
 */
char func(string theString) {

    //  I wonder what I'm doing wrong. 256 doesn't work any better here.
const int CHARS_RECOGED = 65536; // ie 0...65535 - even this isn't good enough to fix test case 7.

unsigned long alphaHisto[CHARS_RECOGED];
for (int count = 0; count < CHARS_RECOGED; count++ ) {
    alphaHisto[ count ] = 0;
} // for int count...

cout << "size: " << theString.size() << endl;

for (int count  = 0; count < theString.size(); count++) {
//        unsigned char uChar = theString.at(count);  // .at() better protected than [] - and this works no differently...
    unsigned int uChar = std::char_traits<char>::to_int_type(theString.at(count));  // .at() better protected than []
    alphaHisto[ uChar ]++;
} // for count...


unsigned char mostCommon = -1;
unsigned long totalMostCommon = 0;

for (int count = 0; count < CHARS_RECOGED; count++ ) {

    if (alphaHisto[ count ] > totalMostCommon){
        mostCommon = count;
        totalMostCommon = alphaHisto[ count ];
    } // if alphahisto

} // for int count...

for (int count = 0; count < CHARS_RECOGED; count++ ) {
    if (alphaHisto[ count ] > 0){
       cout << (char)count << "  " << count << " " << alphaHisto[ count ] << endl;
    } // if alphaHisto...
} // for int count...

return (char) mostCommon;
}
// Please provide additional test cases:
// Input         Return
// thequickbrownfoxjumpsoverthelazydog  e
// the quick brown fox jumps over the lazy dog " "
// theQuickbrownFoxjumpsoverthelazydog  e
// the Quick BroWn Fox JuMpS OVER  THe lazy dog " "
// the_Quick_BroWn_Fox.JuMpS.OVER..THe.LAZY.DOG "."

If the test is anything to take serious, the charset should be specified. Without, it´s probably safe to assume that one byte is one char. Just as side note, to support charsets with multibyte chars, exchanging 256 with 65536 is far from enough, but even without multibyte chars, you could exchange 256 with 1<<CHAR_BITS because a "byte" may have more than 8 bit.

I´m seeing a more important problem with
unsigned int uChar = std::char_traits<char>::to_int_type(theString.at(count));
First, it´s unnecessary complex:
unsigned int uChar = theString.at(count);
should be enough.

Now remember that std::string::at returns a char , and your variable is unsigned int . What char means without explicitely stating if it is signed or unsigned depends on the compiler (ie. if it is signed char or unsigned char ). Now, char values between 0 and 127 will be saved without changes in the target variable, but that´s only half of the value range: If char is unsigned, 128-255 will work fine too, but signed chars, ie. between -128 and -1, won´t map to unsigned 128-255 if the target variable is bigger than the char . With a 4 byte integer, you´ll get some huge values which aren´t valid indices for your array => problem. Solution: Use char , not int .

unsigned char uChar = theString.at(count);

Another thing, :
for (int count = 0; count < theString.size(); count++)
theString.size() returns a size_t which may have differend size and/or signedness compared to int , with huge string lengths there could be problems because of that. Accordingly, the char-counting numbers could be size_t too instead of unsigned long ...

And the least likely problem source, but if this runs on machines without two-complement,
it´ll probably fail spectacularly (altough I didn´t thought it through in detail)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM