简体   繁体   English

在内存映射中使用reinterpret_cast时处理未定义的行为

[英]Dealing with undefined behavior when using reinterpret_cast in a memory mapping

To avoid copying large amounts of data, it is desirable to mmap a binary file and process the raw data directly. 为避免复制大量数据,需要mmap二进制文件并直接处理原始数据。 This approach has several advantages, including relegating the paging to the operating system. 这种方法有几个优点,包括将分页降级到操作系统。 Unfortunately, it is my understanding that the obvious implementation leads to Undefined Behavior (UB). 不幸的是,我的理解是明显的实现会导致未定义的行为(UB)。

My use case is as follows: Create a binary file that contains some header identifying the format and providing metadata (in this case simply the number of double values). 我的用例如下:创建一个二进制文件,其中包含一些标识格式和提供元数据的标头(在本例中只是double值的数量)。 The remainder of the file contains raw binary values which I wish to process without having to first copy the file into a local buffer (that's why I'm memory-mapping the file in the first place). 该文件的其余部分包含我希望处理的原始二进制值,而不必先将文件复制到本地缓冲区(这就是我首先对文件进行内存映射的原因)。 The program below is a full (if simple) example (I believe that all places marked as UB[X] lead to UB): 下面的程序是一个完整的(如果简单的)示例(我相信所有标记为UB[X]导致UB):

// C++ Standard Library
#include <algorithm>
#include <cstddef>
#include <cstdint>
#include <fstream>
#include <iostream>
#include <numeric>

// POSIX Library (for mmap)
#include <fcntl.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <unistd.h>

constexpr char MAGIC[8] = {"1234567"};

struct Header {
  char          magic[sizeof(MAGIC)] = {'\0'};
  std::uint64_t size                 = {0};
};
static_assert(sizeof(Header) == 16, "Header size should be 16 bytes");
static_assert(alignof(Header) == 8, "Header alignment should be 8 bytes");

void write_binary_data(const char* filename) {
  Header header;
  std::copy_n(MAGIC, sizeof(MAGIC), header.magic);
  header.size = 100u;

  std::ofstream fp(filename, std::ios::out | std::ios::binary);
  fp.write(reinterpret_cast<const char*>(&header), sizeof(Header));
  for (auto k = 0u; k < header.size; ++k) {
    double value = static_cast<double>(k);
    fp.write(reinterpret_cast<const char*>(&value), sizeof(double));
  }
}

double read_binary_data(const char* filename) {
  // POSIX mmap API
  auto        fp = ::open(filename, O_RDONLY);
  struct stat sb;
  ::fstat(fp, &sb);
  auto data = static_cast<char*>(
      ::mmap(nullptr, sb.st_size, PROT_READ, MAP_PRIVATE, fp, 0));
  ::close(fp);
  // end of POSIX mmap API (all error handling ommitted)

  // UB1
  const auto header = reinterpret_cast<const Header*>(data);

  // UB2
  if (!std::equal(MAGIC, MAGIC + sizeof(MAGIC), header->magic)) {
    throw std::runtime_error("Magic word mismatch");
  }

  // UB3
  auto beg = reinterpret_cast<const double*>(data + sizeof(Header));

  // UB4
  auto end = std::next(beg, header->size);

  // UB5
  auto sum = std::accumulate(beg, end, double{0});

  ::munmap(data, sb.st_size);

  return sum;
}

int main() {
  const double expected = 4950.0;
  write_binary_data("test-data.bin");

  if (auto sum = read_binary_data("test-data.bin"); sum == expected) {
    std::cout << "as expected, sum is: " << sum << "\n";
  } else {
    std::cout << "error\n";
  }
}

Compile and run as: 编译并运行为:

$ clang++ example.cpp -std=c++17 -Wall -Wextra -O3 -march=native
$ ./a.out
$ as expected, sum is: 4950

In real life, the actual binary format is much more complicated but retains the same properties: Fundamental types stored in a binary file with proper alignment. 在现实生活中,实际的二进制格式要复杂得多,但保留了相同的属性:基本类型存储在二进制文件中,并且具有正确的对齐方式。

My question is: how do you deal with this use case? 我的问题是:你如何处理这个用例?

I have found many answers that I perceive as conflicting. 我发现许多答案我认为是相互矛盾的。

Some answers state unequivocally that one should build the objects locally. 一些答案毫不含糊地说明应该在本地构建对象。 This may very well be the case but severely complicates any array-oriented operations. 这很可能是这种情况,但严重地使任何面向阵列的操作复杂化。

Comments elsewhere seem to agree on the UB nature of this construct but there are some disagreements. 其他地方的评论似乎同意这种结构的UB性质,但存在一些分歧。

The wording in cppreference is, at least to me, confusing. 至少在我看来, cppreference中的措辞令人困惑。 I would have interpreted it as "what I'm doing is perfectly legal". 我会把它解释为“我正在做的事情是完全合法的”。 Specifically this paragraph: 特别是这一段:

Whenever an attempt is made to read or modify the stored value of an object of type DynamicType through a glvalue of type AliasedType, the behavior is undefined unless one of the following is true: 每当尝试通过类型为AliasedType的glvalue读取或修改DynamicType类型的对象的存储值时,除非满足下列条件之一,否则行为是未定义的:

  • AliasedType and DynamicType are similar. AliasedType和DynamicType类似。
  • AliasedType is the (possibly cv-qualified) signed or unsigned variant of DynamicType. AliasedType是DynamicType的(可能是cv限定的)有符号或无符号变体。
  • AliasedType is std::byte, (since C++17)char, or unsigned char: this permits examination of the object representation of any object as an array of bytes. AliasedType是std :: byte,(自C ++ 17开始)char或unsigned char:这允许将任何对象的对象表示检查为字节数组。

It may be that C++17 offers some hope with std::launder or that I'll have to wait until C++20 for something along the lines of std::bit_cast . 可能是C ++ 17为std::launder提供了一些希望,或者我必须等到C ++ 20才能获得std::bit_cast

In the mean time, how do you deal with this issue? 与此同时,您如何处理这个问题?

Link to on-line demo: https://onlinegdb.com/rk_xnlRUV 链接到在线演示: https//onlinegdb.com/rk_xnlRUV

Simplified example in C C中的简化示例

It is my understanding correct that the following C program does not exhibit Undefined Behavior? 我的理解是正确的,以下C程序没有表现出未定义的行为? I understand that the pointer casting through a char buffer does not participate in the strict aliasing rules. 我知道通过char缓冲区转换的指针不参与严格的别名规则。

#include <stdint.h>
#include <stdio.h>

struct Header {
  char     magic[8];
  uint64_t size;
};

static void process(const char* buffer) {
  const struct Header* h = (const struct Header*)(buffer);
  printf("reading %llu values from buffer\n", h->size);
}

int main(int argc, char* argv[]) {
  if (argc != 2) {
    return 1;
  }
  // In practice, I'd pass the buffer through mmap
  FILE* fp = fopen(argv[1], "rb");
  char  buffer[sizeof(struct Header)];
  fread(buffer, sizeof(struct Header), 1, fp);
  fclose(fp);
  process(buffer);
}

I can compile and run this C code by passing the file created by the original, C++ program and works as expected: 我可以通过传递原始C ++程序创建的文件来编译和运行此C代码,并按预期工作:

$ clang struct.c -std=c11 -Wall -Wextra -O3 -march=native
$ ./a.out test-data.bin 
reading 100 values from buffer

std::launder solves the problem with strict aliasing, but not with object lifetime. std::launder使用严格别名来解决问题,但不能解决对象生存期问题。

std::bit_cast makes a copy (it's basically a wrapper for std::memcpy ) and doesn't work with copying from a range of bytes. std::bit_cast创建一个副本(它基本上是std::memcpy的包装器),并且不能从一系列字节复制。

There is no tool in standard C++ to reinterpret mapped memory without copying. 标准C ++中没有工具可以在不复制的情况下重新解释映射内存。 Such tool has been proposed: std::bless . 这样的工具已被提出: std :: bless Until/unless such changes are adopted into the standard, you'll have to either hope that UB doesn't break anything , take the potential †† performance hit and copy, or write the program in C. 除非这些更改被采用到标准中,否则你必须要么希望UB不破坏任何 ,取得潜在的††性能命中并复制,或者用C编写程序。

While not ideal, this is not necessarily as bad as it sounds. 虽然不理想,但这并不像听起来那么糟糕。 You're already restricting portability by using mmap , and if your target system / compiler promises that it is OK to reinterpret mmap ped memory (perhaps with laundering), then there should be no problem. 您已经通过使用mmap限制可移植性,并且如果您的目标系统/编译器承诺可以重新解释mmap ped内存(可能使用清洗),那么应该没有问题。 That said, I don't know if say, GCC on Linux gives such guarantee. 也就是说,我不知道是不是说,Linux上的GCC给出了这样的保证。

†† The compiler may optimise std::memcpy away. ††编译器可能会优化std::memcpy There might not be any performance hit involved. 可能没有任何性能损失。 There's a handy function in this SO answer which was observed to be optimised away, but does initiate object lifetime following the language rules. 在这个SO答案中有一个方便的功能,被观察到被优化了,但确实按照语言规则启动了对象的生命周期。 It does have a limitation the mapped memory must be writable (as it creates objects in the memory, and in non-optimised build it might do an actual copy). 它确实有一个限制,映射的内存必须是可写的(因为它在内存中创建对象,而在非优化的构建中它可能会进行实际的复制)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 在枚举类上使用`reinterpret_cast`-有效或未定义的行为? - Using `reinterpret_cast` on an enum class - valid or undefined behavior? 是否使用对通过 reinterpret_cast 未定义行为进行转换的指针的引用? - Is using reference to pointer that was casted with reinterpret_cast undefined behavior? reinterpret_cast,char *和未定义的行为 - reinterpret_cast, char*, and undefined behavior reinterpret_cast会导致未定义的行为吗? - Does reinterpret_cast lead to undefined behavior? reinterpret_cast / static_cast和未定义的行为 - reinterpret_cast / static_cast and undefined behavior reinterpret_cast的怪异行为 - Weird behavior of reinterpret_cast 将函数指针 reinterpret_cast 与 void(*)() 进行比较是未定义的行为吗? - Is it undefined behavior to compare function pointers reinterpret_cast to void(*)()? reinterpret_cast类型是否实际上是未定义的行为? - Is reinterpret_cast type punning actually undefined behavior? 重新解释无关类型的对象到空类是否是未定义的行为 - Is it undefined behavior to reinterpret_cast an object of an unrelated type to an empty class `reinterpret_cast`是一个&#39;T *`到&#39;T(*)[N]`是不确定的行为? - Is it undefined behavior to `reinterpret_cast` a `T*` to `T(*)[N]`?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM