简体   繁体   English

`u8string_view` 转换成 `char` 数组而不违反严格混叠?

[英]`u8string_view` into a `char` array without violating strict-aliasing?

Premise前提

  • I have a blob of binary data in memory, represented as a char* (maybe read from a file, or transmitted over the network).我在 memory 中有一个二进制数据块,表示为char* (可能从文件中读取,或通过网络传输)。
  • I know that it contains a UTF8-encoded text field of a certain length at a certain offset.我知道它在某个偏移量处包含一个一定长度的 UTF8 编码文本字段。

Question问题

How can I (safely and portably) get a u8string_view to represent the contents of this text field?我如何(安全且可移植地)获取u8string_view来表示此文本字段的内容?

Motivation动机

The motivation for passing the field to down-stream code as a u8string_view is:将该字段作为u8string_view传递给下游代码的动机是:

  • It very clearly communicates that the text field is UTF8-encoded, unlike string_view .string_view不同,它非常清楚地表明文本字段是 UTF8 编码的。
  • It avoids the cost (likely free-store allocation + copying) of returning it as u8string .它避免了将其返回为u8string的成本(可能是免费存储分配+复制)。

What I tried我试过的

The naive way to do this, would be:这样做的天真方法是:

char* data = ...;
size_t field_offset = ...;
size_t field_length = ...;

char8_t* field_ptr = reinterpret_cast<char8_t*>(data + field_offset);
u8string_view field(field_ptr, field_length);

However, if I understand the C++ strict-aliasing rules correctly, this is undefined behavior because it accesses the contents of the char* buffer via the char8_t* pointer returned by reinterpret_cast , and char8_t is not an aliasing type.但是,如果我正确理解 C++ 严格别名规则,这是未定义的行为,因为它通过reinterpret_cast返回的char8_t*指针访问char*缓冲区的内容,并且char8_t不是别名类型。

Is that true?真的吗?

Is there a way to do this safely?有没有办法安全地做到这一点?

The strict aliasing rule happen when you access an object with a glvalue that has not an acceptable type .当您访问具有不可接受类型的泛左值的 object 时,会发生严格的别名规则。

First consider a well defined case:首先考虑一个定义明确的案例:

char* data = reinterpret_cast <char *> (new char8_t[10]{})
size_t field_offset = 0;
size_t field_length = 10;
char8_t* field_ptr = reinterpret_cast<char8_t*>(data + field_offset);
u8string_view field(field_ptr, field_length);
field [0]+field[1];

There is no UB here.这里没有UB。 You create an array of char8_t then access the element of the array.您创建一个char8_t数组,然后访问该数组的元素。

Now what happen if the object that is the memory referenced by data is created by another program?现在,如果data引用的 memory 的 object 是由另一个程序创建的,会发生什么? According to the standard this is UB, because the object is not created by one of the specified way to create it .根据标准这是 UB,因为 object 不是通过指定的创建方式之一创建的。

But the fact that your code is not yet supported by the standard is not a problem here.但是,标准尚未支持您的代码这一事实在这里不是问题。 This code is supported by all compilers.所有编译器都支持此代码。 If it were not, nothing would work, you could not even do the simplest system call because most of the communication between a program and any kernel is through array of char.如果不是这样,什么都不会起作用,您甚至无法进行最简单的系统调用,因为程序与任何 kernel 之间的大部分通信都是通过 char 数组进行的。 So as long as inside your program you access the memory that is between data+field_offset and data+field_offset+field_length through a glvalue of type char8_t your code will work as expected.因此,只要在程序内部通过 char8_t 类型的 glvalue 访问位于data+field_offsetdata+field_offset+field_length之间的char8_t ,您的代码就会按预期工作。

This same problem occurs occasionally in other contexts too, including the use of shared memory for example.同样的问题偶尔也会出现在其他情况下,例如使用共享 memory。

A trick to create objects using bits in "raw" memory without allocating memory is to create a local object by memcpy, and then create a dynamic copy of that local object over the "raw" memory. A trick to create objects using bits in "raw" memory without allocating memory is to create a local object by memcpy, and then create a dynamic copy of that local object over the "raw" memory. Example:例子:

char* begin_raw = data + field_offset;
char8_t* last {};
for(std::ptrdiff_t i = 0; i < field_length; i++) {
    char* current = begin_raw + i;
    char8_t local {};
    std::memcpy(&local, current, sizeof local);
    last = new (current) char8_t(local);
}
char8_t* begin = last - (field_length - 1);
std::u8string_view field(begin, field_length);

Before you object that you don't want to copy, notice that the end result causes no changes to the representation of the "raw" memory.在您不想复制 object 之前,请注意最终结果不会导致“原始”memory 的表示发生变化。 The compiler can notice this too, and can compile the entire loop into zero instructions (in my tests GCC and Clang achieve this with -O2).编译器也可以注意到这一点,并且可以将整个循环编译为零指令(在我的测试中,GCC 和 Clang 使用 -O2 实现了这一点)。 All that we have done is satisfy the object lifetime rules of the language by creating dynamic objects into the memory.我们所做的就是通过在 memory 中创建动态对象来满足语言的 object 生命周期规则。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 甚至在不进行任何强制转换的情况下也违反了严格混叠? - Violating strict-aliasing, even without any casting? 违反反序列化的反向工程严格混叠 - Back-engineering strict-aliasing violating deserialization 如何在不违反严格别名规则的情况下解析字节数组? - How to parse byte array without violating strict aliasing rule? 将char数组转换为其他类型会违反严格别名规则吗? - Does casting a char array to another type violate strict-aliasing rules? 使用数据缓冲区而不会违反严格的别名 - Using a data buffer without violating strict aliasing 基于通用char []的存储并避免与严格别名相关的UB - Generic char[] based storage and avoiding strict-aliasing related UB 模板化的抽象基类的c ++数组,而没有违反严格的混叠规则 - c++ array of templated abstract-base-class without breaking strict-aliasing rule C ++中的共享内存缓冲区,不违反严格的别名规则 - Shared memory buffers in C++ without violating strict aliasing rules 在不违反严格的别名规则的情况下访问进程间共享内存中的对象 - Access object in interprocess shared memory without violating strict aliasing rules C++20 string/u8string 和 string_view/u8string_view 之间的转换 - C++20 converting between string/u8string and string_view/u8string_view
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM