有没有一种算法可以快速将大量的十六进制字符串转换为字节 stream？ asm/C/C++

Question

Here is my current code:这是我当前的代码：

//Input:hex string , 1234ABCDEEFF0505DDCC ....
//Output：BYTE stream
void HexString2Hex(/*IN*/ char* hexstring, /*OUT*/  BYTE* hexBuff)
{
    for (int i = 0; i < strlen(hexstring); i += 2)
    {
        BYTE val = 0;
        if (hexstring[i] < 'A')
            val += 0x10 * (hexstring[i] - '0');
        else
            val += 0xA0 + 0x10 * (hexstring[i] - 'A');

        if (hexstring[i+1] < 'A')
            val += hexstring[i + 1] - '0';
        else
            val += 0xA + hexstring[i + 1] - 'A';

        hexBuff[i / 2] = val;
    }
}

the problem is: when the input hex string is very big (such as 1000000 length), this function will take hundred seconds which is unacceptable for me.问题是：当输入的十六进制字符串非常大（例如 1000000 长度）时，这个 function 将需要一百秒，这对我来说是不可接受的。 (CPU: i7-8700,3.2GHz. Memory:32G) (CPU: i7-8700,3.2GHz。Memory:32G)

So, is there any alternative algorithms to do the work more quickly?那么，是否有任何替代算法可以更快地完成工作？

Thank you guys感谢你们

Edit1: thank paddy's comment. Edit1：感谢稻田的评论。 I was too careless to notice that strlen( time:O(n)) was executed hundreds times.我太粗心了，没有注意到 strlen(time:O(n)) 被执行了数百次。 so my original function is O(n*n) which is so terrible.所以我原来的 function 是 O(n*n) 这太可怕了。

updated code is below:更新的代码如下：

int len=strlen(hexstring);
for (int i = 0; i < len; i += 2)

And, for Emanuel P 's suggestion, I tried,it didn't seems good.而且，对于 Emanuel P 的建议，我试过了，似乎不太好。 the below is my code以下是我的代码

map<string, BYTE> by_map;

//init table (map here)
char *xx1 = "0123456789ABCDEF";
    for (int i = 0; i < 16;i++)
    {
        for (int j = 0; j < 16; j++)
        {
            
            _tmp[0] = xx1[i];
            _tmp[1] = xx1[j];

            BYTE val = 0;
            if (xx1[i] < 'A')
                val += 0x10 * (xx1[i] - '0');
            else
                val += 0xA0 + 0x10 * (xx1[i] - 'A');

            if (xx1[j] < 'A')
                val += xx1[j] - '0';
            else
                val += 0xA + xx1[j] - 'A';

            by_map.insert(map<string, BYTE>::value_type(_tmp, val));
        }
    }

//search map
void HexString2Hex2(char* hexstring, BYTE* hexBuff)
{
    char _tmp[3] = { 0 };
    for (int i = 0; i < strlen(hexstring); i += 2)
    {
        _tmp[0] = hexstring[i];
        _tmp[1] = hexstring[i + 1];
        //DWORD dw = 0;
        //sscanf(_tmp, "%02X", &dw);
        hexBuff[i / 2] = by_map[_tmp];
    }
}

Edit2: In fact, my problem is solved when I fix the strlen bug. Edit2：事实上，当我修复 strlen 错误时，我的问题就解决了。 Below is my final code:下面是我的最终代码：

void HexString2Bytes(/*IN*/ char* hexstr, /*OUT*/  BYTE* dst)
{
    static uint_fast8_t LOOKUP[256];
    for (int i = 0; i < 10; i++)
    {
        LOOKUP['0' + i] = i;
    }
    for (int i = 0; i < 6; i++)
    {
        LOOKUP['A' + i] = 0xA + i;
    }

    for (size_t i = 0; hexstr[i] != '\0'; i += 2)
    {
        *dst = LOOKUP[hexstr[i]] << 4 |
            LOOKUP[hexstr[i + 1]];
        dst++;
    }
}

Btw, sincerely thank you guys.顺便说一句，真诚地感谢你们。 You are awesome!你太棒了！ real researchers!真正的研究人员！

Answer 1

The standard way to create the most efficient code possible (at the cost of RAM/ROM) is to use look-up tables.创建最有效代码的标准方法（以 RAM/ROM 为代价）是使用查找表。 Something like this:像这样的东西：

static const uint_fast8_t LOOKUP [256] =
{
  ['0'] = 0x0, ['1'] = 0x1, ['2'] = 0x2, ['3'] = 0x3,
  ['4'] = 0x4, ['5'] = 0x5, ['6'] = 0x6, ['7'] = 0x7,
  ['8'] = 0x8, ['9'] = 0x9, ['A'] = 0xA, ['B'] = 0xB,
  ['C'] = 0xC, ['D'] = 0xD, ['E'] = 0xE, ['F'] = 0xF,
};

This sacrifices 256 bytes of read-only memory and in turn we don't have to do any form of arithmetic.这牺牲了 256 字节的只读 memory，反过来我们不必进行任何形式的算术运算。 The uint_fast8_t lets the compiler pick a larger type if it thinks that will help performance. uint_fast8_t允许编译器选择一个更大的类型，如果它认为这将有助于提高性能。

The full code would then be something like this:完整的代码将是这样的：

void hexstr_to_bytes (const char* restrict hexstr, uint8_t* restrict dst)
{
  static const uint_fast8_t LOOKUP [256] =
  {
    ['0'] = 0x0, ['1'] = 0x1, ['2'] = 0x2, ['3'] = 0x3,
    ['4'] = 0x4, ['5'] = 0x5, ['6'] = 0x6, ['7'] = 0x7,
    ['8'] = 0x8, ['9'] = 0x9, ['A'] = 0xA, ['B'] = 0xB,
    ['C'] = 0xC, ['D'] = 0xD, ['E'] = 0xE, ['F'] = 0xF,
  };
  
  for(size_t i=0; hexstr[i]!='\0'; i+=2)
  {
    *dst = LOOKUP[ hexstr[i  ] ] << 4 |
           LOOKUP[ hexstr[i+1] ];
    dst++;
  }
}

This boils down to some ~10 instructions when tested on a x86_64 ( Godbolt ).在 x86_64 ( Godbolt ) 上测试时，这归结为大约 10 条指令。 Branch-free apart from the loop condition.除循环条件外，无分支。 Notably there's no error checking what so ever, so you'd have to ensure that the data is OK (and contains an even amount of nibbles) elsewhere.值得注意的是，从来没有错误检查过，所以你必须确保其他地方的数据是好的（并且包含偶数的半字节）。

Test code:测试代码：

#include <stdio.h>
#include <stdint.h>

void hexstr_to_bytes (const char* restrict hexstr, uint8_t* restrict dst)
{
  static const uint_fast8_t LOOKUP [256] =
  {
    ['0'] = 0x0, ['1'] = 0x1, ['2'] = 0x2, ['3'] = 0x3,
    ['4'] = 0x4, ['5'] = 0x5, ['6'] = 0x6, ['7'] = 0x7,
    ['8'] = 0x8, ['9'] = 0x9, ['A'] = 0xA, ['B'] = 0xB,
    ['C'] = 0xC, ['D'] = 0xD, ['E'] = 0xE, ['F'] = 0xF,
  };
  
  for(size_t i=0; hexstr[i]!='\0'; i+=2)
  {
    *dst = LOOKUP[ hexstr[i  ] ] << 4 |
           LOOKUP[ hexstr[i+1] ];
    dst++;
  }
}

int main (void)
{
  const char hexstr[] = "DEADBEEFC0FFEE";
  uint8_t bytes [(sizeof hexstr - 1)/2];
  hexstr_to_bytes(hexstr, bytes);
  
  for(size_t i=0; i<sizeof bytes; i++)
  {
    printf("%.2X ", bytes[i]);
  }
}

Answer 2

when the input hex string is very big (such as 1000000 length)当输入的十六进制字符串很大时（比如 1000000 长度）

Actually, 1 meg isn't all that long for today's computers.实际上，对于今天的计算机来说，1 兆格并不是那么长。

If you need to be able to handle bigger strings (think 10s of gigabytes), or even just a LOT of 1 meg strings, you can play with the SSE functions.如果您需要能够处理更大的字符串（想想 10 千兆字节），甚至只是很多 1 兆的字符串，您可以使用 SSE 函数。 While it will work for more modest requirements, the added complexity may not be worth the performance gain.虽然它适用于更温和的要求，但增加的复杂性可能不值得性能提升。

I'm on Windows, so I'm building with MSVC 2019. x64, optimizations enabled, and arch:AVX2.我在 Windows 上，所以我正在使用 MSVC 2019.x64、启用优化和 arch:AVX2 进行构建。

#define _CRT_SECURE_NO_WARNINGS
typedef unsigned char BYTE;

#include <stdio.h>
#include <memory.h>
#include <intrin.h>
#include <immintrin.h>
#include <stdint.h>

static const uint_fast8_t LOOKUP[256] = {
    0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
    0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
    0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
    0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
    0x00, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f };

void HexString2Bytes(/*IN*/ const char* hexstr, /*OUT*/  BYTE* dst)
{
    for (size_t i = 0; hexstr[i] != '\0'; i += 2)
    {
        *dst = LOOKUP[hexstr[i]] << 4 |
            LOOKUP[hexstr[i + 1]];
        dst++;
    }
}

void HexString2BytesSSE(const char* ptrin, char *ptrout, size_t bytes)
{
    register const __m256i mmZeros = _mm256_set1_epi64x(0x3030303030303030ll);
    register const __m256i mmNines = _mm256_set1_epi64x(0x0909090909090909ll);
    register const __m256i mmSevens = _mm256_set1_epi64x(0x0707070707070707ll);
    register const __m256i mmShuffle = _mm256_set_epi64x(-1, 0x0f0d0b0907050301, -1, 0x0f0d0b0907050301);

    //============

    const __m256i* in = (const __m256i*)ptrin;
    __m128i* out = (__m128i *)ptrout;
    size_t lines = bytes / 32;

    for (size_t x = 0; x < lines; x++)
    {
        // Read 32 bytes
        __m256i AllBytes = _mm256_load_si256(in);

        // subtract '0' from every byte
        AllBytes = _mm256_sub_epi8(AllBytes, mmZeros);

        // Look for bytes that are 'A' or greater
        const __m256i mask = _mm256_cmpgt_epi8(AllBytes, mmNines);

        // Assign 7 to every byte greater than 'A'
        const __m256i maskedvalues = _mm256_and_si256(mask, mmSevens);

        // Subtract 7 from every byte greater than 'A'
        AllBytes = _mm256_sub_epi8(AllBytes, maskedvalues);

        // At this point, every byte in AllBytes represents a nibble, with
        // the even bytes being the upper nibble.

        // Make a copy and shift it left 4 bits to shift the nibble, plus
        // 8 bits to align the nibbles.
        __m256i UpperNibbles = _mm256_slli_epi64(AllBytes, 4 + 8);

        // Combine the nibbles
        AllBytes = _mm256_or_si256(AllBytes, UpperNibbles);

        // At this point, the odd numbered bytes in AllBytes is the output we want.

        // Move the bytes to be contiguous.  Note that you can only move
        // bytes within their 128bit lane.
        const __m256i ymm1 = _mm256_shuffle_epi8(AllBytes, mmShuffle);

        // Move the bytes from the upper lane down next to the lower.
        const __m256i ymm2 = _mm256_permute4x64_epi64(ymm1, 8);

        // Pull out the lowest 16 bytes
        *out = _mm256_extracti128_si256(ymm2, 0);

        in++;
        out++;
    }
}

int main()
{
    FILE* f = fopen("test.txt", "rb");

    fseek(f, 0, SEEK_END);
    size_t fsize = _ftelli64(f);
    rewind(f);

    // HexString2Bytes requires trailing null
    char* InBuff = (char* )_aligned_malloc(fsize + 1, 32);

    size_t t = fread(InBuff, 1, fsize, f);
    fclose(f);

    InBuff[fsize] = 0;

    char* OutBuff = (char*)malloc(fsize / 2);
    char* OutBuff2 = nullptr;

    putchar('A');

    for (int x = 0; x < 16; x++)
    {
        HexString2BytesSSE(InBuff, OutBuff, fsize);
#if 0
        if (OutBuff2 == nullptr)
        {
            OutBuff2 = (char*)malloc(fsize / 2);
        }
        HexString2Bytes(InBuff, (BYTE*)OutBuff2);
        if (memcmp(OutBuff, OutBuff2, fsize / 32) != 0)
            printf("oops\n");
        putchar('.');
#endif
    }

    putchar('B');

    if (OutBuff2 != nullptr)
        free(OutBuff2);
    free(OutBuff);
    _aligned_free(InBuff);
}

A couple of things to notice:有几点需要注意：

There is no error handling here.这里没有错误处理。 I don't check for out of memory, or file read errors.我不检查 memory 或文件读取错误。 I don't even check for invalid characters in the input stream or lower case hex digits.我什至不检查输入 stream 或小写十六进制数字中的无效字符。
This code assumes that the size of the string is available without having to walk the string (ftelli64 in this case).此代码假定字符串的大小是可用的，而无需遍历字符串（在本例中为 ftelli64）。 If you need to walk the string byte-by-byte to get its length (a la strlen), you've probably lost the benefit here.如果您需要逐字节遍历字符串以获取其长度（a la strlen），那么您可能已经失去了这里的好处。
I've kept HexString2Bytes, so you can compare the outputs from my code vs yours to makes sure I'm converting correctly.我保留了 HexString2Bytes，因此您可以比较我的代码与您的代码的输出，以确保我正确转换。
HexString2BytesSSE assumes the number of bytes in the string is evenly divisible by 32 (a questionable assumption). HexString2BytesSSE 假设字符串中的字节数可以被 32 整除（一个有问题的假设）。 However, reworking it to call HexString2Bytes for the last (at most) 31 bytes is pretty trivial, and isn't going to impact performance much.但是，将其重新设计为最后（最多）31 个字节调用 HexString2Bytes 非常简单，并且不会对性能产生太大影响。
My test.txt is 2 gigs long, and this code runs it 16 times.我的 test.txt 有 2 gigs 长，这段代码运行了 16 次。 That's about what I need for the differences to become readily visible.这就是我需要让差异变得显而易见。

For people who want to kibitz (because of course you do), here's the assembler output for the innermost loop along with some comments:对于想要 kibitz 的人（因为你当然愿意），这是最内层循环的汇编程序 output 以及一些注释：

10F0  lea         rax,[rax+10h]   ; Output pointer
10F4  vmovdqu     ymm0,ymmword ptr [rcx] ; Input data
10F8  lea         rcx,[rcx+20h]

; Convert characters to nibbles
10FC  vpsubb      ymm2,ymm0,ymm4  ; Subtract 0x30 from all characters
1100  vpcmpgtb    ymm1,ymm2,ymm5  ; Find all characters 'A' and greater
1104  vpand       ymm0,ymm1,ymm6  ; Prepare to subtract 7 from all the 'A' 
1108  vpsubb      ymm2,ymm2,ymm0  ; Adjust all the 'A'

; Combine the nibbles to form bytes
110C  vpsllq      ymm1,ymm2,0Ch   ; Shift nibble up + align nibbles
1111  vpor        ymm0,ymm1,ymm2  ; Combine lower and upper nibbles

; Coalesce the odd numbered bytes
1115  vpshufb     ymm2,ymm0,ymm7

; Since vpshufb can't cross lanes, use vpermq to
; put all 16 bytes together
111A  vpermq      ymm3,ymm2,8

1120  vmovdqu     xmmword ptr [rax-10h],xmm3
1125  sub         rdx,1
1129  jne         main+0F0h (10F0h)

While your final code is almost certainly sufficient for your needs, I thought this might be interesting for you (or future SO users).虽然您的最终代码几乎可以肯定足以满足您的需求，但我认为这对您（或未来的 SO 用户）可能会很有趣。

Answer 3

Maybe a switch is (marginally) faster也许开关（稍微）更快

switch (hexchar) {
    default: /* error */; break;
    case '0': nibble = 0; break;
    case '1': nibble = 1; break;
    //...
    case 'F': case 'f': nibble = 15; break;
}

Answer 4

Boost already has a unhex algorithm implementation , you may compare the benchmark result as a baseline: Boost 已经有一个unhex算法实现，您可以将基准测试结果作为基线进行比较：

unhex ( "616263646566", out )  --> "abcdef"
unhex ( "3332", out )          --> "32"

If your string is very huge, then you may consider some parallel approach (using threads based framework like OpenMP, parallel STL)如果您的字符串非常大，那么您可以考虑一些并行方法（使用基于线程的框架，如 OpenMP、并行 STL）

Answer 5

Direct answer: I do not know the perfect Algorithm.直接回答：我不知道完美的算法。 x86 asm: Per Intel performance guide- Unwind the loop. x86 asm：根据英特尔性能指南 - 展开循环。 Try the XLAT instruction(2 different tables needed)[eliminates conditional branches].尝试 XLAT 指令（需要 2 个不同的表）[消除条件分支]。 Modify the call interface to include explicit block length as the caller should know the string length[eliminate strlen() ].修改调用接口以包含显式块长度，因为调用者应该知道字符串长度[消除strlen() ]。 Test the output array space for large enough: minor bug- remember that an odd length divided by two is rounded down.测试 output 数组空间是否足够大：小错误 - 请记住，奇数长度除以 2 会向下舍入。 Therefore if odd length of source, initialize last byte of output (only).因此，如果源的长度为奇数，则初始化 output 的最后一个字节（仅）。 Change return to type int from void so you can pass error or success codes and length processed.将 return 从 void 更改为 int 类型，以便您可以传递错误或成功代码以及处理的长度。 Handle null length input.处理 null 长度输入。 Advantage of doing in blocks is the practical limit becomes the OS file size limit.分块做的好处是实际限制变成了操作系统文件大小限制。 Try setting thread affinity.尝试设置线程亲和性。 I suspect the limitation on performance ultimately is RAM to CPU bus, depending.我怀疑性能的限制最终是 RAM 到 CPU 总线，具体取决于。 If so, try to do data fetch and stores on largest bit width supported by RAM.如果是这样，请尝试在 RAM 支持的最大位宽上进行数据提取和存储。 Bench test with no optimize and higher levels if coding in c or c++.如果在 c 或 c++ 中编码，则没有优化和更高级别的基准测试。 Test validity by doing reverse process followed by byte for byte compare (non-zero chance CRC-32 misses).通过执行反向过程然后进行逐字节比较（非零概率 CRC-32 未命中）来测试有效性。 Possible problem with PBYTE- use native c unsigned char type. PBYTE 可能存在问题 - 使用本机 c 无符号字符类型。 There is a to be tested trade off between code size and L1 - number of cache misses vs how much loop unwound.在代码大小和 L1 之间有一个需要测试的权衡 - 缓存未命中数与多少循环展开。 In asm use cx/ecx/rcx to count down (rather than the usual count up and compare).在 asm 中使用 cx/ecx/rcx 进行倒计时（而不是通常的向上计数和比较）。 SIMD is also possible assuming CPU support.假设 CPU 支持，SIMD 也是可能的。

有没有一种算法可以快速将大量的十六进制字符串转换为字节 stream？ asm/C/C++

问题描述

5 个解决方案

解决方案1
5 2021-04-12 08:06:31

解决方案2
2 2021-04-19 20:53:19

解决方案3
1 2021-04-12 07:34:08

解决方案4
1 2021-04-12 08:02:16

解决方案5
0 2021-04-16 02:18:52

有没有一种算法可以快速将大量的十六进制字符串转换为字节 stream？ asm/C/C++

问题描述

5 个解决方案

解决方案1 5 2021-04-12 08:06:31

解决方案2 2 2021-04-19 20:53:19

解决方案3 1 2021-04-12 07:34:08

解决方案4 1 2021-04-12 08:02:16

解决方案5 0 2021-04-16 02:18:52

解决方案1
5 2021-04-12 08:06:31

解决方案2
2 2021-04-19 20:53:19

解决方案3
1 2021-04-12 07:34:08

解决方案4
1 2021-04-12 08:02:16

解决方案5
0 2021-04-16 02:18:52