简体   繁体   中英

How to improve this character count algorithm

I would like to construct a function that performs an file analisys returning in array every byte count from 0x0 to 0xff and it's frequency.

So, I wrote this prototype:

// function prototype  and other stuff

unsigned int counts[256] = {0}; // byte lookup table 
FILE * pFile;                   // file handle
long fsize;             // to store file size
unsigned char* buff;            // buffer
unsigned char* pbuf;            // later, mark buffer start
unsigned char* ebuf;            // later, mark buffer end

if ( ( pFile = fopen ( FNAME , "rb" ) ) == NULL )
{
    printf("Error");
    return -1;
}
else
{
    //get file size
    fseek (pFile , 0 , SEEK_END);
    fsize = ftell (pFile);
    rewind (pFile);

    // allocate space ( file size + 1 )
    // I want file contents as string for populating it
    // with pointers
    buff = (unsigned char*)malloc( sizeof(char) * fsize + 1 );

    // read whole file into memory
    fread(buff,1,fsize,pFile);

    // close file
    fclose(pFile);

    // mark end of buffer as string
    buff[fsize] = '\0';

    // set the pointers to beginning and end
    pbuf = &buff[0];
    ebuf = &buff[fsize];


            // Here the Bottleneck
    // iterate entire file byte by byte
            // counting bytes 
    while ( pbuf != ebuf)
    {
        printf("%c\n",*pbuf);
                    // update byte count
        counts[(*pbuf)]++;
        ++pbuf;                             
    }


    // free allocated memory
    free(buff);
    buff = NULL;

}
// printing stuff

But this way is slower. I am finding related algorithms because I have seen HxD for example do it faster.

I think maybe reading some bytes at once could be a solution, but I don't know how.

I need a hand, or advice.

Thanks.

Assuming your file isn't so large it causes the system to start paging because you are reading the whole thing into memory, your algorithm is as good as it gets for general purpose data - O(n) .

You'll need to remove the printf (as commented above); but beyond that if the performance isn't higher than the only way to improve it will be to look at the generated assembler - possibly the compiler isn't optimizing out all the de-references (gcc should do though).

If you happen to know something about your dataset, then there are potential improvements - if it is a bitmap type image that is likely to have blocks of identical bytes then it may be worth doing a little run length encoding. There could also be some data sets where it is actually worth sorting the data first (although that reduces the general case down to O(nlog(n)) , so it's unlikely.

the rle would look something like (untested and probably sub-optimal off the top of my head disclaimer)

unsigned int cur_count=1;
unsigned char cbuf=*(++pbuf);

while ( pbuf != ebuf)
{
    while( pbuf != ebuf && cbuf == *pbuf )
    {
        cur_count++;
        pbuf++;
    }  
    counts[cbuf]+=cur_count;
    cur_count=0;                             
}
counts[cbuf]+=cur_count;

You can often trade an increase in program size for an improvement in speed and I think that could work nicely in your case. I would consider replacing your unsigned char* pointers with unsigned short* pointers and effectively processing two bytes at a time. That way, you have half the number of array index increments, half the number of calculations of offsets into your accumulator, half the number of accumulating additions and half the number of tests to see if your loop has finished.

Like I said, this will come at the expense of increased program size, so your accumulator array now needs 65536 elements instead of 256, but that is a small price to pay. I admit there is a tradeoff with legibility too.

At the end, you will have to run an index through all 65536 elements of my new, bigger accumulator and mask it with 0xff to get the first byte and shift by 8 bits to get the second. Then you will have the two indexes corresponding your original accumulator and you can do the 2 accumulates into your original 256 accumulator from there.

PS Note that although you can handle nearly all the file 2 bytes at a time, you will have to handle the last byte on its own if your file size is an odd number of bytes.

PPS Note that this problem is readily parallelisable across, say 4, threads, if you want to get your spare 3 CPU cores doing something more useful than twiddling their thumbs ;-)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM