简体   繁体   中英

Is there an algorithm to convert massive hex string to bytes stream QUICKLY? asm/C/C++

Here is my current code:

//Input:hex string , 1234ABCDEEFF0505DDCC ....
//Output:BYTE stream
void HexString2Hex(/*IN*/ char* hexstring, /*OUT*/  BYTE* hexBuff)
{
    for (int i = 0; i < strlen(hexstring); i += 2)
    {
        BYTE val = 0;
        if (hexstring[i] < 'A')
            val += 0x10 * (hexstring[i] - '0');
        else
            val += 0xA0 + 0x10 * (hexstring[i] - 'A');

        if (hexstring[i+1] < 'A')
            val += hexstring[i + 1] - '0';
        else
            val += 0xA + hexstring[i + 1] - 'A';

        hexBuff[i / 2] = val;
    }
}

the problem is: when the input hex string is very big (such as 1000000 length), this function will take hundred seconds which is unacceptable for me. (CPU: i7-8700,3.2GHz. Memory:32G)

So, is there any alternative algorithms to do the work more quickly?

Thank you guys

Edit1: thank paddy's comment. I was too careless to notice that strlen( time:O(n)) was executed hundreds times. so my original function is O(n*n) which is so terrible.

updated code is below:

int len=strlen(hexstring);
for (int i = 0; i < len; i += 2)

And, for Emanuel P 's suggestion, I tried,it didn't seems good. the below is my code

map<string, BYTE> by_map;
//init table (map here)
char *xx1 = "0123456789ABCDEF";
    for (int i = 0; i < 16;i++)
    {
        for (int j = 0; j < 16; j++)
        {
            
            _tmp[0] = xx1[i];
            _tmp[1] = xx1[j];

            BYTE val = 0;
            if (xx1[i] < 'A')
                val += 0x10 * (xx1[i] - '0');
            else
                val += 0xA0 + 0x10 * (xx1[i] - 'A');

            if (xx1[j] < 'A')
                val += xx1[j] - '0';
            else
                val += 0xA + xx1[j] - 'A';

            by_map.insert(map<string, BYTE>::value_type(_tmp, val));
        }
    }
//search map
void HexString2Hex2(char* hexstring, BYTE* hexBuff)
{
    char _tmp[3] = { 0 };
    for (int i = 0; i < strlen(hexstring); i += 2)
    {
        _tmp[0] = hexstring[i];
        _tmp[1] = hexstring[i + 1];
        //DWORD dw = 0;
        //sscanf(_tmp, "%02X", &dw);
        hexBuff[i / 2] = by_map[_tmp];
    }
}

Edit2: In fact, my problem is solved when I fix the strlen bug. Below is my final code:

void HexString2Bytes(/*IN*/ char* hexstr, /*OUT*/  BYTE* dst)
{
    static uint_fast8_t LOOKUP[256];
    for (int i = 0; i < 10; i++)
    {
        LOOKUP['0' + i] = i;
    }
    for (int i = 0; i < 6; i++)
    {
        LOOKUP['A' + i] = 0xA + i;
    }

    for (size_t i = 0; hexstr[i] != '\0'; i += 2)
    {
        *dst = LOOKUP[hexstr[i]] << 4 |
            LOOKUP[hexstr[i + 1]];
        dst++;
    }
}

Btw, sincerely thank you guys. You are awesome! real researchers!

The standard way to create the most efficient code possible (at the cost of RAM/ROM) is to use look-up tables. Something like this:

static const uint_fast8_t LOOKUP [256] =
{
  ['0'] = 0x0, ['1'] = 0x1, ['2'] = 0x2, ['3'] = 0x3,
  ['4'] = 0x4, ['5'] = 0x5, ['6'] = 0x6, ['7'] = 0x7,
  ['8'] = 0x8, ['9'] = 0x9, ['A'] = 0xA, ['B'] = 0xB,
  ['C'] = 0xC, ['D'] = 0xD, ['E'] = 0xE, ['F'] = 0xF,
};

This sacrifices 256 bytes of read-only memory and in turn we don't have to do any form of arithmetic. The uint_fast8_t lets the compiler pick a larger type if it thinks that will help performance.

The full code would then be something like this:

void hexstr_to_bytes (const char* restrict hexstr, uint8_t* restrict dst)
{
  static const uint_fast8_t LOOKUP [256] =
  {
    ['0'] = 0x0, ['1'] = 0x1, ['2'] = 0x2, ['3'] = 0x3,
    ['4'] = 0x4, ['5'] = 0x5, ['6'] = 0x6, ['7'] = 0x7,
    ['8'] = 0x8, ['9'] = 0x9, ['A'] = 0xA, ['B'] = 0xB,
    ['C'] = 0xC, ['D'] = 0xD, ['E'] = 0xE, ['F'] = 0xF,
  };
  
  for(size_t i=0; hexstr[i]!='\0'; i+=2)
  {
    *dst = LOOKUP[ hexstr[i  ] ] << 4 |
           LOOKUP[ hexstr[i+1] ];
    dst++;
  }
}

This boils down to some ~10 instructions when tested on a x86_64 ( Godbolt ). Branch-free apart from the loop condition. Notably there's no error checking what so ever, so you'd have to ensure that the data is OK (and contains an even amount of nibbles) elsewhere.

Test code:

#include <stdio.h>
#include <stdint.h>

void hexstr_to_bytes (const char* restrict hexstr, uint8_t* restrict dst)
{
  static const uint_fast8_t LOOKUP [256] =
  {
    ['0'] = 0x0, ['1'] = 0x1, ['2'] = 0x2, ['3'] = 0x3,
    ['4'] = 0x4, ['5'] = 0x5, ['6'] = 0x6, ['7'] = 0x7,
    ['8'] = 0x8, ['9'] = 0x9, ['A'] = 0xA, ['B'] = 0xB,
    ['C'] = 0xC, ['D'] = 0xD, ['E'] = 0xE, ['F'] = 0xF,
  };
  
  for(size_t i=0; hexstr[i]!='\0'; i+=2)
  {
    *dst = LOOKUP[ hexstr[i  ] ] << 4 |
           LOOKUP[ hexstr[i+1] ];
    dst++;
  }
}

int main (void)
{
  const char hexstr[] = "DEADBEEFC0FFEE";
  uint8_t bytes [(sizeof hexstr - 1)/2];
  hexstr_to_bytes(hexstr, bytes);
  
  for(size_t i=0; i<sizeof bytes; i++)
  {
    printf("%.2X ", bytes[i]);
  }
}

when the input hex string is very big (such as 1000000 length)

Actually, 1 meg isn't all that long for today's computers.

If you need to be able to handle bigger strings (think 10s of gigabytes), or even just a LOT of 1 meg strings, you can play with the SSE functions. While it will work for more modest requirements, the added complexity may not be worth the performance gain.

I'm on Windows, so I'm building with MSVC 2019. x64, optimizations enabled, and arch:AVX2.

#define _CRT_SECURE_NO_WARNINGS
typedef unsigned char BYTE;

#include <stdio.h>
#include <memory.h>
#include <intrin.h>
#include <immintrin.h>
#include <stdint.h>

static const uint_fast8_t LOOKUP[256] = {
    0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
    0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
    0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
    0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
    0x00, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f };

void HexString2Bytes(/*IN*/ const char* hexstr, /*OUT*/  BYTE* dst)
{
    for (size_t i = 0; hexstr[i] != '\0'; i += 2)
    {
        *dst = LOOKUP[hexstr[i]] << 4 |
            LOOKUP[hexstr[i + 1]];
        dst++;
    }
}

void HexString2BytesSSE(const char* ptrin, char *ptrout, size_t bytes)
{
    register const __m256i mmZeros = _mm256_set1_epi64x(0x3030303030303030ll);
    register const __m256i mmNines = _mm256_set1_epi64x(0x0909090909090909ll);
    register const __m256i mmSevens = _mm256_set1_epi64x(0x0707070707070707ll);
    register const __m256i mmShuffle = _mm256_set_epi64x(-1, 0x0f0d0b0907050301, -1, 0x0f0d0b0907050301);

    //============

    const __m256i* in = (const __m256i*)ptrin;
    __m128i* out = (__m128i *)ptrout;
    size_t lines = bytes / 32;

    for (size_t x = 0; x < lines; x++)
    {
        // Read 32 bytes
        __m256i AllBytes = _mm256_load_si256(in);

        // subtract '0' from every byte
        AllBytes = _mm256_sub_epi8(AllBytes, mmZeros);

        // Look for bytes that are 'A' or greater
        const __m256i mask = _mm256_cmpgt_epi8(AllBytes, mmNines);

        // Assign 7 to every byte greater than 'A'
        const __m256i maskedvalues = _mm256_and_si256(mask, mmSevens);

        // Subtract 7 from every byte greater than 'A'
        AllBytes = _mm256_sub_epi8(AllBytes, maskedvalues);

        // At this point, every byte in AllBytes represents a nibble, with
        // the even bytes being the upper nibble.

        // Make a copy and shift it left 4 bits to shift the nibble, plus
        // 8 bits to align the nibbles.
        __m256i UpperNibbles = _mm256_slli_epi64(AllBytes, 4 + 8);

        // Combine the nibbles
        AllBytes = _mm256_or_si256(AllBytes, UpperNibbles);

        // At this point, the odd numbered bytes in AllBytes is the output we want.

        // Move the bytes to be contiguous.  Note that you can only move
        // bytes within their 128bit lane.
        const __m256i ymm1 = _mm256_shuffle_epi8(AllBytes, mmShuffle);

        // Move the bytes from the upper lane down next to the lower.
        const __m256i ymm2 = _mm256_permute4x64_epi64(ymm1, 8);

        // Pull out the lowest 16 bytes
        *out = _mm256_extracti128_si256(ymm2, 0);

        in++;
        out++;
    }
}

int main()
{
    FILE* f = fopen("test.txt", "rb");

    fseek(f, 0, SEEK_END);
    size_t fsize = _ftelli64(f);
    rewind(f);

    // HexString2Bytes requires trailing null
    char* InBuff = (char* )_aligned_malloc(fsize + 1, 32);

    size_t t = fread(InBuff, 1, fsize, f);
    fclose(f);

    InBuff[fsize] = 0;

    char* OutBuff = (char*)malloc(fsize / 2);
    char* OutBuff2 = nullptr;

    putchar('A');

    for (int x = 0; x < 16; x++)
    {
        HexString2BytesSSE(InBuff, OutBuff, fsize);
#if 0
        if (OutBuff2 == nullptr)
        {
            OutBuff2 = (char*)malloc(fsize / 2);
        }
        HexString2Bytes(InBuff, (BYTE*)OutBuff2);
        if (memcmp(OutBuff, OutBuff2, fsize / 32) != 0)
            printf("oops\n");
        putchar('.');
#endif
    }

    putchar('B');

    if (OutBuff2 != nullptr)
        free(OutBuff2);
    free(OutBuff);
    _aligned_free(InBuff);
}

A couple of things to notice:

  • There is no error handling here. I don't check for out of memory, or file read errors. I don't even check for invalid characters in the input stream or lower case hex digits.
  • This code assumes that the size of the string is available without having to walk the string (ftelli64 in this case). If you need to walk the string byte-by-byte to get its length (a la strlen), you've probably lost the benefit here.
  • I've kept HexString2Bytes, so you can compare the outputs from my code vs yours to makes sure I'm converting correctly.
  • HexString2BytesSSE assumes the number of bytes in the string is evenly divisible by 32 (a questionable assumption). However, reworking it to call HexString2Bytes for the last (at most) 31 bytes is pretty trivial, and isn't going to impact performance much.
  • My test.txt is 2 gigs long, and this code runs it 16 times. That's about what I need for the differences to become readily visible.

For people who want to kibitz (because of course you do), here's the assembler output for the innermost loop along with some comments:

10F0  lea         rax,[rax+10h]   ; Output pointer
10F4  vmovdqu     ymm0,ymmword ptr [rcx] ; Input data
10F8  lea         rcx,[rcx+20h]

; Convert characters to nibbles
10FC  vpsubb      ymm2,ymm0,ymm4  ; Subtract 0x30 from all characters
1100  vpcmpgtb    ymm1,ymm2,ymm5  ; Find all characters 'A' and greater
1104  vpand       ymm0,ymm1,ymm6  ; Prepare to subtract 7 from all the 'A' 
1108  vpsubb      ymm2,ymm2,ymm0  ; Adjust all the 'A'

; Combine the nibbles to form bytes
110C  vpsllq      ymm1,ymm2,0Ch   ; Shift nibble up + align nibbles
1111  vpor        ymm0,ymm1,ymm2  ; Combine lower and upper nibbles

; Coalesce the odd numbered bytes
1115  vpshufb     ymm2,ymm0,ymm7

; Since vpshufb can't cross lanes, use vpermq to
; put all 16 bytes together
111A  vpermq      ymm3,ymm2,8

1120  vmovdqu     xmmword ptr [rax-10h],xmm3
1125  sub         rdx,1
1129  jne         main+0F0h (10F0h)

While your final code is almost certainly sufficient for your needs, I thought this might be interesting for you (or future SO users).

Maybe a switch is (marginally) faster

switch (hexchar) {
    default: /* error */; break;
    case '0': nibble = 0; break;
    case '1': nibble = 1; break;
    //...
    case 'F': case 'f': nibble = 15; break;
}

Boost already has a unhex algorithm implementation , you may compare the benchmark result as a baseline:

unhex ( "616263646566", out )  --> "abcdef"
unhex ( "3332", out )          --> "32"

If your string is very huge, then you may consider some parallel approach (using threads based framework like OpenMP, parallel STL)

Direct answer: I do not know the perfect Algorithm. x86 asm: Per Intel performance guide- Unwind the loop. Try the XLAT instruction(2 different tables needed)[eliminates conditional branches]. Modify the call interface to include explicit block length as the caller should know the string length[eliminate strlen() ]. Test the output array space for large enough: minor bug- remember that an odd length divided by two is rounded down. Therefore if odd length of source, initialize last byte of output (only). Change return to type int from void so you can pass error or success codes and length processed. Handle null length input. Advantage of doing in blocks is the practical limit becomes the OS file size limit. Try setting thread affinity. I suspect the limitation on performance ultimately is RAM to CPU bus, depending. If so, try to do data fetch and stores on largest bit width supported by RAM. Bench test with no optimize and higher levels if coding in c or c++. Test validity by doing reverse process followed by byte for byte compare (non-zero chance CRC-32 misses). Possible problem with PBYTE- use native c unsigned char type. There is a to be tested trade off between code size and L1 - number of cache misses vs how much loop unwound. In asm use cx/ecx/rcx to count down (rather than the usual count up and compare). SIMD is also possible assuming CPU support.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM