Here is my current code:
//Input:hex string , 1234ABCDEEFF0505DDCC ....
//Output:BYTE stream
void HexString2Hex(/*IN*/ char* hexstring, /*OUT*/ BYTE* hexBuff)
{
for (int i = 0; i < strlen(hexstring); i += 2)
{
BYTE val = 0;
if (hexstring[i] < 'A')
val += 0x10 * (hexstring[i] - '0');
else
val += 0xA0 + 0x10 * (hexstring[i] - 'A');
if (hexstring[i+1] < 'A')
val += hexstring[i + 1] - '0';
else
val += 0xA + hexstring[i + 1] - 'A';
hexBuff[i / 2] = val;
}
}
the problem is: when the input hex string is very big (such as 1000000 length), this function will take hundred seconds which is unacceptable for me. (CPU: i7-8700,3.2GHz. Memory:32G)
So, is there any alternative algorithms to do the work more quickly?
Thank you guys
Edit1: thank paddy's comment. I was too careless to notice that strlen( time:O(n)) was executed hundreds times. so my original function is O(n*n) which is so terrible.
updated code is below:
int len=strlen(hexstring);
for (int i = 0; i < len; i += 2)
And, for Emanuel P 's suggestion, I tried,it didn't seems good. the below is my code
map<string, BYTE> by_map;
//init table (map here)
char *xx1 = "0123456789ABCDEF";
for (int i = 0; i < 16;i++)
{
for (int j = 0; j < 16; j++)
{
_tmp[0] = xx1[i];
_tmp[1] = xx1[j];
BYTE val = 0;
if (xx1[i] < 'A')
val += 0x10 * (xx1[i] - '0');
else
val += 0xA0 + 0x10 * (xx1[i] - 'A');
if (xx1[j] < 'A')
val += xx1[j] - '0';
else
val += 0xA + xx1[j] - 'A';
by_map.insert(map<string, BYTE>::value_type(_tmp, val));
}
}
//search map
void HexString2Hex2(char* hexstring, BYTE* hexBuff)
{
char _tmp[3] = { 0 };
for (int i = 0; i < strlen(hexstring); i += 2)
{
_tmp[0] = hexstring[i];
_tmp[1] = hexstring[i + 1];
//DWORD dw = 0;
//sscanf(_tmp, "%02X", &dw);
hexBuff[i / 2] = by_map[_tmp];
}
}
Edit2: In fact, my problem is solved when I fix the strlen bug. Below is my final code:
void HexString2Bytes(/*IN*/ char* hexstr, /*OUT*/ BYTE* dst)
{
static uint_fast8_t LOOKUP[256];
for (int i = 0; i < 10; i++)
{
LOOKUP['0' + i] = i;
}
for (int i = 0; i < 6; i++)
{
LOOKUP['A' + i] = 0xA + i;
}
for (size_t i = 0; hexstr[i] != '\0'; i += 2)
{
*dst = LOOKUP[hexstr[i]] << 4 |
LOOKUP[hexstr[i + 1]];
dst++;
}
}
Btw, sincerely thank you guys. You are awesome! real researchers!
The standard way to create the most efficient code possible (at the cost of RAM/ROM) is to use look-up tables. Something like this:
static const uint_fast8_t LOOKUP [256] =
{
['0'] = 0x0, ['1'] = 0x1, ['2'] = 0x2, ['3'] = 0x3,
['4'] = 0x4, ['5'] = 0x5, ['6'] = 0x6, ['7'] = 0x7,
['8'] = 0x8, ['9'] = 0x9, ['A'] = 0xA, ['B'] = 0xB,
['C'] = 0xC, ['D'] = 0xD, ['E'] = 0xE, ['F'] = 0xF,
};
This sacrifices 256 bytes of read-only memory and in turn we don't have to do any form of arithmetic. The uint_fast8_t
lets the compiler pick a larger type if it thinks that will help performance.
The full code would then be something like this:
void hexstr_to_bytes (const char* restrict hexstr, uint8_t* restrict dst)
{
static const uint_fast8_t LOOKUP [256] =
{
['0'] = 0x0, ['1'] = 0x1, ['2'] = 0x2, ['3'] = 0x3,
['4'] = 0x4, ['5'] = 0x5, ['6'] = 0x6, ['7'] = 0x7,
['8'] = 0x8, ['9'] = 0x9, ['A'] = 0xA, ['B'] = 0xB,
['C'] = 0xC, ['D'] = 0xD, ['E'] = 0xE, ['F'] = 0xF,
};
for(size_t i=0; hexstr[i]!='\0'; i+=2)
{
*dst = LOOKUP[ hexstr[i ] ] << 4 |
LOOKUP[ hexstr[i+1] ];
dst++;
}
}
This boils down to some ~10 instructions when tested on a x86_64 ( Godbolt ). Branch-free apart from the loop condition. Notably there's no error checking what so ever, so you'd have to ensure that the data is OK (and contains an even amount of nibbles) elsewhere.
Test code:
#include <stdio.h>
#include <stdint.h>
void hexstr_to_bytes (const char* restrict hexstr, uint8_t* restrict dst)
{
static const uint_fast8_t LOOKUP [256] =
{
['0'] = 0x0, ['1'] = 0x1, ['2'] = 0x2, ['3'] = 0x3,
['4'] = 0x4, ['5'] = 0x5, ['6'] = 0x6, ['7'] = 0x7,
['8'] = 0x8, ['9'] = 0x9, ['A'] = 0xA, ['B'] = 0xB,
['C'] = 0xC, ['D'] = 0xD, ['E'] = 0xE, ['F'] = 0xF,
};
for(size_t i=0; hexstr[i]!='\0'; i+=2)
{
*dst = LOOKUP[ hexstr[i ] ] << 4 |
LOOKUP[ hexstr[i+1] ];
dst++;
}
}
int main (void)
{
const char hexstr[] = "DEADBEEFC0FFEE";
uint8_t bytes [(sizeof hexstr - 1)/2];
hexstr_to_bytes(hexstr, bytes);
for(size_t i=0; i<sizeof bytes; i++)
{
printf("%.2X ", bytes[i]);
}
}
when the input hex string is very big (such as 1000000 length)
Actually, 1 meg isn't all that long for today's computers.
If you need to be able to handle bigger strings (think 10s of gigabytes), or even just a LOT of 1 meg strings, you can play with the SSE functions. While it will work for more modest requirements, the added complexity may not be worth the performance gain.
I'm on Windows, so I'm building with MSVC 2019. x64, optimizations enabled, and arch:AVX2.
#define _CRT_SECURE_NO_WARNINGS
typedef unsigned char BYTE;
#include <stdio.h>
#include <memory.h>
#include <intrin.h>
#include <immintrin.h>
#include <stdint.h>
static const uint_fast8_t LOOKUP[256] = {
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f };
void HexString2Bytes(/*IN*/ const char* hexstr, /*OUT*/ BYTE* dst)
{
for (size_t i = 0; hexstr[i] != '\0'; i += 2)
{
*dst = LOOKUP[hexstr[i]] << 4 |
LOOKUP[hexstr[i + 1]];
dst++;
}
}
void HexString2BytesSSE(const char* ptrin, char *ptrout, size_t bytes)
{
register const __m256i mmZeros = _mm256_set1_epi64x(0x3030303030303030ll);
register const __m256i mmNines = _mm256_set1_epi64x(0x0909090909090909ll);
register const __m256i mmSevens = _mm256_set1_epi64x(0x0707070707070707ll);
register const __m256i mmShuffle = _mm256_set_epi64x(-1, 0x0f0d0b0907050301, -1, 0x0f0d0b0907050301);
//============
const __m256i* in = (const __m256i*)ptrin;
__m128i* out = (__m128i *)ptrout;
size_t lines = bytes / 32;
for (size_t x = 0; x < lines; x++)
{
// Read 32 bytes
__m256i AllBytes = _mm256_load_si256(in);
// subtract '0' from every byte
AllBytes = _mm256_sub_epi8(AllBytes, mmZeros);
// Look for bytes that are 'A' or greater
const __m256i mask = _mm256_cmpgt_epi8(AllBytes, mmNines);
// Assign 7 to every byte greater than 'A'
const __m256i maskedvalues = _mm256_and_si256(mask, mmSevens);
// Subtract 7 from every byte greater than 'A'
AllBytes = _mm256_sub_epi8(AllBytes, maskedvalues);
// At this point, every byte in AllBytes represents a nibble, with
// the even bytes being the upper nibble.
// Make a copy and shift it left 4 bits to shift the nibble, plus
// 8 bits to align the nibbles.
__m256i UpperNibbles = _mm256_slli_epi64(AllBytes, 4 + 8);
// Combine the nibbles
AllBytes = _mm256_or_si256(AllBytes, UpperNibbles);
// At this point, the odd numbered bytes in AllBytes is the output we want.
// Move the bytes to be contiguous. Note that you can only move
// bytes within their 128bit lane.
const __m256i ymm1 = _mm256_shuffle_epi8(AllBytes, mmShuffle);
// Move the bytes from the upper lane down next to the lower.
const __m256i ymm2 = _mm256_permute4x64_epi64(ymm1, 8);
// Pull out the lowest 16 bytes
*out = _mm256_extracti128_si256(ymm2, 0);
in++;
out++;
}
}
int main()
{
FILE* f = fopen("test.txt", "rb");
fseek(f, 0, SEEK_END);
size_t fsize = _ftelli64(f);
rewind(f);
// HexString2Bytes requires trailing null
char* InBuff = (char* )_aligned_malloc(fsize + 1, 32);
size_t t = fread(InBuff, 1, fsize, f);
fclose(f);
InBuff[fsize] = 0;
char* OutBuff = (char*)malloc(fsize / 2);
char* OutBuff2 = nullptr;
putchar('A');
for (int x = 0; x < 16; x++)
{
HexString2BytesSSE(InBuff, OutBuff, fsize);
#if 0
if (OutBuff2 == nullptr)
{
OutBuff2 = (char*)malloc(fsize / 2);
}
HexString2Bytes(InBuff, (BYTE*)OutBuff2);
if (memcmp(OutBuff, OutBuff2, fsize / 32) != 0)
printf("oops\n");
putchar('.');
#endif
}
putchar('B');
if (OutBuff2 != nullptr)
free(OutBuff2);
free(OutBuff);
_aligned_free(InBuff);
}
A couple of things to notice:
For people who want to kibitz (because of course you do), here's the assembler output for the innermost loop along with some comments:
10F0 lea rax,[rax+10h] ; Output pointer
10F4 vmovdqu ymm0,ymmword ptr [rcx] ; Input data
10F8 lea rcx,[rcx+20h]
; Convert characters to nibbles
10FC vpsubb ymm2,ymm0,ymm4 ; Subtract 0x30 from all characters
1100 vpcmpgtb ymm1,ymm2,ymm5 ; Find all characters 'A' and greater
1104 vpand ymm0,ymm1,ymm6 ; Prepare to subtract 7 from all the 'A'
1108 vpsubb ymm2,ymm2,ymm0 ; Adjust all the 'A'
; Combine the nibbles to form bytes
110C vpsllq ymm1,ymm2,0Ch ; Shift nibble up + align nibbles
1111 vpor ymm0,ymm1,ymm2 ; Combine lower and upper nibbles
; Coalesce the odd numbered bytes
1115 vpshufb ymm2,ymm0,ymm7
; Since vpshufb can't cross lanes, use vpermq to
; put all 16 bytes together
111A vpermq ymm3,ymm2,8
1120 vmovdqu xmmword ptr [rax-10h],xmm3
1125 sub rdx,1
1129 jne main+0F0h (10F0h)
While your final code is almost certainly sufficient for your needs, I thought this might be interesting for you (or future SO users).
Maybe a switch is (marginally) faster
switch (hexchar) {
default: /* error */; break;
case '0': nibble = 0; break;
case '1': nibble = 1; break;
//...
case 'F': case 'f': nibble = 15; break;
}
Boost already has a unhex
algorithm implementation , you may compare the benchmark result as a baseline:
unhex ( "616263646566", out ) --> "abcdef"
unhex ( "3332", out ) --> "32"
If your string is very huge, then you may consider some parallel approach (using threads based framework like OpenMP, parallel STL)
Direct answer: I do not know the perfect Algorithm. x86 asm: Per Intel performance guide- Unwind the loop. Try the XLAT instruction(2 different tables needed)[eliminates conditional branches]. Modify the call interface to include explicit block length as the caller should know the string length[eliminate strlen()
]. Test the output array space for large enough: minor bug- remember that an odd length divided by two is rounded down. Therefore if odd length of source, initialize last byte of output (only). Change return to type int from void so you can pass error or success codes and length processed. Handle null length input. Advantage of doing in blocks is the practical limit becomes the OS file size limit. Try setting thread affinity. I suspect the limitation on performance ultimately is RAM to CPU bus, depending. If so, try to do data fetch and stores on largest bit width supported by RAM. Bench test with no optimize and higher levels if coding in c or c++. Test validity by doing reverse process followed by byte for byte compare (non-zero chance CRC-32 misses). Possible problem with PBYTE- use native c unsigned char type. There is a to be tested trade off between code size and L1 - number of cache misses vs how much loop unwound. In asm use cx/ecx/rcx to count down (rather than the usual count up and compare). SIMD is also possible assuming CPU support.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.