简体   繁体   English

如何在C中连接字节数组

[英]How to concat byte arrays in C

My current concat function: 我当前的concat函数:

char* concat(char* a, int a_size,
             char* b, int b_size) {
    char* c = malloc(a_size + b_size);
    memcpy(c, a,            a_size);
    memcpy(c + a_size, b,   b_size);
    free(a);
    free(b);
    return c;
}

But this used extra memory. 但这占用了额外的内存。 Is it possible to append two byte arrays using realloc without making extra memory space? 是否可以使用realloc追加两个字节数组,而又不占用额外的存储空间?

Like: 喜欢:

void append(char* a, int a_size, char* b, int b_size)
...

char* a = malloc(2);
char* b = malloc(2);

void append(a, 2, b, 2);
//The size of a will be 4.

While Jean-François Fabre answered the stated question, I'd like to point out that you can manage such byte arrays better by using a structure: 尽管Jean-FrançoisFabre回答了上述问题,但我想指出,您可以通过使用以下结构更好地管理此类字节数组:

typedef struct {
    size_t         max;  /* Number of chars allocated for */
    size_t         len;  /* Number of chars in use */
    unsigned char *data;
} bytearray;
#define  BYTEARRAY_INIT  { 0, 0, NULL }

void bytearray_init(bytearray *barray)
{
    barray->max  = 0;
    barray->len  = 0;
    barray->data = NULL;
}

void bytearray_free(bytearray *barray)
{
    free(barray->data);
    barray->max  = 0;
    barray->len  = 0;
    barray->data = NULL;
}

To declare an empty byte array, you can use either bytearray myba = BYTEARRAY_INIT; 要声明一个空字节数组,可以使用字节数组bytearray myba = BYTEARRAY_INIT; or bytearray myba; bytearray_init(&myba); bytearray myba; bytearray_init(&myba); bytearray myba; bytearray_init(&myba); . The two are equivalent. 两者是等效的。

When you no longer need the array, call bytearray_free(&myba); 当您不再需要该数组时,请调用bytearray_free(&myba); . Note that free(NULL) is safe and does nothing, so it is perfectly safe to free a bytearray that you have initialized, but not used. 请注意, free(NULL)是安全的且不执行任何操作,因此释放已初始化但未使用的bytearray是绝对安全的。

To append to a bytearray : 要追加到bytearray

int bytearray_append(bytearray *barray, const void *from, const size_t size)
{
    if (barray->len + size > barray->max) {
        const size_t  len = barray->len + size;
        size_t        max;
        void         *data;

        /* Example policy: */
        if (len < 8)
            max = 8; /* At least 8 chars, */
        else
        if (len < 4194304)
            max = (3*len) / 2;  /* grow by 50% up to 4,194,304 bytes, */
        else
            max = (len | 2097151) + 2097153 - 24; /* then pad to next multiple of 2,097,152 sans 24 bytes. */

        data = realloc(barray->data, max);
        if (!data) {
            /* Not enough memory available. Old data is still valid. */
            return -1;
        }

        barray->max  = max;
        barray->data = data;
    }

    /* Copy appended data; we know there is room now. */
    memmove(barray->data + barray->len, from, size);
    barray->len += size;

    return 0;
}

Since this function can at least theoretically fail to reallocate memory, it will return 0 if successful, and nonzero if it cannot reallocate enough memory. 由于此函数在理论上至少无法重新分配内存,因此,如果成功,它将返回0如果无法重新分配足够的内存,则将返回非零。

There is no need for a malloc() call, because realloc(NULL, size) is exactly equivalent to malloc(size) . 不需要调用malloc() ,因为realloc(NULL, size)完全等同于malloc(size)

The "growth policy" is a very debatable issue. “增长政策”是一个值得商de的问题。 You can just make max = barray->len + size , and be done with it. 您可以只使max = barray->len + size ,并完成它。 However, dynamic memory management functions are relatively slow, so in practice, we don't want to call realloc() for every small little addition. 但是,动态内存管理功能相对较慢,因此在实践中,我们不想为每一个小的附加调用调用realloc()

The above policy tries to do something better, but not too aggressive: it always allocates at least 8 characters, even if less is needed. 上面的策略尝试做的更好,但又不要太过激:即使需要的字符数最少,它总是分配至少8个字符。 Up to 4,194,304 characters, it allocates 50% extra. 最多4,194,304个字符,它会额外分配50%。 Above that, it rounds the allocation size to the next multiple of 2,097,152 and substracts 24. The reasoning behid this is complex, but it is more for illustration and understanding than anything else; 在此之上,它会将分配大小四舍五入为2,097,152的下一个倍数,并减去24。这背后的推理很复杂,但它比其他任何事情都更能说明和理解; it is definitely NOT "this is best, and this is what you should do too" . 绝对不是“这是最好的,这也是您应该做的” This policy ensures that each byte array allocates at most 4,194,304 = 2 22 unused characters. 此策略确保每个字节数组最多分配4194304 = 2 22个未使用的字符。 However, 2,097,152 = 2 21 is the size of a huge page on AMD64 (x86-64), and is a power-of-two multiple of a native page size on basically all architectures. 但是,2,097,152 = 2 21是AMD64(x86-64)上一个大页面的大小,并且基本上是所有体系结构上本机页面大小的2的幂。 It is also large enough to switch from so-called sbrk() allocation to memory mapping on basically all architectures that do that. 它也足够大,可以从所谓的sbrk()分配切换到基本上所有执行此操作的体系结构上的内存映射。 It means that such huge allocations use a separate part of the heap for each, and the unused part is usually just virtual memory, not necessarily backed by any RAM, until accessed. 这意味着,如此巨大的分配会为每个分配使用堆的单独部分,未使用的部分通常只是虚拟内存,在访问之前不一定要由任何RAM支持。 As a result, this policy tends to work quite well for both very short byte arrays, and very long byte arrays, on most architectures. 其结果是,这一政策往往工作得很好两个很短的字节数组,和很长的字节数组,在大多数架构。

Of course, if you know (or measure!) the typical size of the byte arrays in typical workloads, you can optimize the growth policy for that, and get even better results. 当然,如果您知道 (或测量!)典型工作负载中字节数组的典型大小,则可以为此优化增长策略,甚至获得更好的结果。

Finally, it uses memmove() instead of memcpy() , just in case someone wishes to repeat a part of the same byte array: memcpy() only works if the source and target areas do not overlap; 最后,它使用memmove()代替memcpy() ,以防万一有人希望重复同一字节数组的一部分: memcpy()仅在源区域和目标区域不重叠的情况下有效; memmove() works even in that case. 即使在这种情况下, memmove()仍然有效。


When using more advanced data structures, like hash tables, a variant of the above structure is often useful. 当使用哈希表等更高级的数据结构时,上述结构的变体通常很有用。 (That is, this is much better in cases where you have lots of empty byte arrays.) (也就是说,在您有很多空字节数组的情况下,这样做会更好。)

Instead of having a pointer to the data, the data is part of the structure itself, as a C99 flexible array member: 数据没有指向数据的指针,而是作为C99灵活数组成员的结构本身的一部分:

typedef struct {
    size_t         max;
    size_t         len;
    unsigned char  data[];
} bytearray;

You cannot declare a byte array itself (ie bytearray myba; will not work); 您不能声明字节数组本身(即bytearray myba;将不起作用); you always declare a pointer to a such byte arrays: bytearray *myba = NULL; 您总是声明一个指向此类字节数组的指针bytearray *myba = NULL; . The pointer being NULL is just treated the same as an empty byte array. 指针为NULL的情况与空字节数组相同。

In particular, to see how many data items such an array has, you use an accessor function (also defined in the same header file as the data structure), rather than myba.len : 特别是,要查看此类数组有多少个data项,请使用访问器函数(也与数据结构在同一头文件中定义),而不要使用myba.len

static inline size_t  bytearray_len(bytearray *const barray)
{
    return (barray) ? barray->len : 0;
}

static inline size_t  bytearray_max(bytearray *const barray)
{
    return (barray) ? barray->max : 0;
}

The (expression) ? (if-true) : (if-false) (expression) ? (if-true) : (if-false) (expression) ? (if-true) : (if-false) is a ternary operator. (expression) ? (if-true) : (if-false)是三元运算符。 In this case, the first function is exactly equivalent to 在这种情况下,第一个功能与

static inline size_t  bytearray_len(bytearray *const barray)
{
    if (barray)
        return barray->len;
    else
        return 0;
}

If you wonder about the bytearray *const barray , remember that pointer declarations are read from right to left, with * as "a pointer to". 如果您对bytearray *const barray ,请记住,指针声明是从右到左读取的,其中*是“指向”的指针。 So, it just means that barray is constant, a pointer to a byte array. 因此,这仅表示barray不变,是指向字节数组的指针。 That is, we may change the data it points to, but we won't change the pointer itself. 也就是说,我们可以更改其指向的数据,但不会更改指针本身。 Compilers can usually detect such stuff themselves, but it may help; 编译器通常可以自己检测这些东西,但这可能会有所帮助。 the main point is however to remind us human programmers that the pointer itself is not to be changed. 但是,主要要点是提醒我们人类程序员不要更改指针本身。 (Such changes would only be visible within the function itself.) (这些更改仅在函数本身中可见。)

Since such arrays often need to be resized, the resizing is often put into a separate helper function: 由于通常需要调整此类数组的大小,因此调整大小通常会放在单独的帮助函数中:

bytearray *bytearray_resize(bytearray *const barray, const size_t len)
{
    bytearray *temp;

    if (!len) {
        free(barray);
        errno = 0;
        return NULL;
    }

    if (!barray) {
        temp = malloc(sizeof (bytearray) + len * sizeof barray->data[0]);
        if (!temp) {
            errno = ENOMEM;
            return NULL;
        }

        temp->max = len;
        temp->len = 0;
        return temp;
    }

    if (barray->len > len)
        barray->len = len;

    if (barray->max == len)
        return barray;

    temp = realloc(barray, sizeof (bytearray) + len * sizeof barray->data[0]);
    if (!temp) {
        free(barray);
        errno = ENOMEM;
        return NULL;
    }

    temp->max = len;
    return temp;
}

What does that errno = 0 do in there? errno = 0在那做什么? The idea is that because resizing/reallocating a byte array may change the pointer, we return the new one. 这个想法是因为调整大小/重新分配字节数组可能会更改指针,所以我们返回了新的指针。 If the allocation fails, we return NULL with errno == ENOMEM , just like malloc() / realloc() do. 如果分配失败,我们将使用errno == ENOMEM返回NULL ,就像malloc() / realloc()一样。 However, since the desired new length was zero, this saves memory by freeing the old byte array if any, and returns NULL . 但是,由于所需的新长度为零,因此可以通过释放旧的字节数组(如果有)来节省内存,并返回NULL But since that is not an error, we set errno to zero, so that it is easier for callers to check if an error occurred or not. 但这不是错误,因此我们将errno为零,以便调用者更容易检查是否发生错误。 (If the function returns NULL , check errno . If errno is nonzero, an error occurred; you can use strerror(errno) to get a descriptive error message.) (如果函数返回NULL ,请检查errno 。如果errno不为零,则发生错误;您可以使用strerror(errno)来获取描述性错误消息。)

You probably also noted the sizeof barray->data[0] , used even when barray is NULL. 您可能还注意到了sizeof barray->data[0] ,即使barray为NULL也可以使用。 This is okay, because sizeof is not a function, but an operator: it does not access the right side at all, it only evaluates to the size of the thing the right side refers to. 没关系,因为sizeof不是函数,而是运算符:它根本不访问右侧,它只求值右侧所指对象的大小。 (You only need to use parentheses when the right size is a type.) This form is nice, because it lets a programmer change the type of the data member, without changing any other code. (仅当类型为正确的大小时,才需要使用括号。)这种形式很好,因为它使程序员可以更改data成员的类型,而无需更改任何其他代码。

To append data to such a byte array, we probably want to be able to specify whether we anticipate further appends to the same array, or whether this is probably the final append, so that only the exact needed amount of memory is needed. 要将数据追加到这样的字节数组中,我们可能希望能够指定是否期望对同一数组进行进一步的追加,或者这是否可能是最终的追加,以便仅需要确切所需的内存量。 For simplicity, I'll only implement the exact size version here. 为简单起见,我将仅在此处实现确切大小的版本。 Note that this function returns a pointer to the (modified) byte array: 请注意,此函数返回一个指向(已修改的)字节数组的指针:

bytearray *bytearray_append(bytearray *barray,
                            const void *from, const size_t size,
                            int exact)
{
    size_t  len = bytearray_len(barray) + size;

    if (exact) {
        barray = bytearray_resize(barray, len);
        if (!barray)
            return NULL; /* errno already set by bytearray_resize(). */

    } else
    if (bytearray_max(barray) < len) {            

        if (!exact) {

            /* Apply growth policy */
            if (len < 8)
                len = 8;
            else
            if (len < 4194304)
                len = (3 * len) / 2;
            else
                len = (len | 2097151) + 2097153 - 24;
        }

        barray = bytearray_resize(barray, len);
        if (!barray)
            return NULL; /* errno already set by the bytearray_resize() call */
    }

    if (size) {
        memmove(barray->data + barray->len, from, size);
        barray->len += size;
    }

    return barray;
}

This time, we declared bytearray *barray , because we change where barray points to in the function. 这次,我们声明了bytearray *barray ,因为我们改变了barray指向函数的位置。 If the fourth parameter, final , is nonzero, then the resulting byte array is exactly the size needed; 如果第四个参数final为非零,则结果字节数组恰好是所需的大小; otherwise the growth policy is applied. 否则将采用增长政策。

yes, since realloc will preserve the start of your buffer if the new size is bigger: 是的,因为如果新大小更大,则realloc将保留缓冲区的开始:

char* concat(char* a, size_t a_size,
             char* b, size_t b_size) {
    char* c = realloc(a, a_size + b_size);
    memcpy(c + a_size, b,  b_size);  // dest is after "a" data, source is b with b_size
    free(b);
    return c;
}

c may be different from a (if the original memory block cannot be resized in-place contiguously to the new size by the system) but if that's the case, the location pointed by a will be freed (you must not free it), and the original data will be "moved". c可能会有所不同,从a (如果原来的内存块的大小不能调整就地连续通过系统中的新的大小),但如果是这样的情况下,所指向的位置, a将被释放(你不能释放它),和原始数据将被“移动”。

My advice is to warn the users of your function that the input buffers must be allocated using malloc , else it will crash badly. 我的建议是警告您的函数的用户输入缓冲区必须使用malloc分配,否则它将严重崩溃。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM