从两个 32 位定时器计数器读取 64 位定时器值时，正确的 ARM64(AArch64) 数据 memory 屏障用法是什么？

Question

For the sequence to read 64bit timer value from two 32bit timer counters mentioned in https://developer.arm.com/documentation/100400/0001/multiprocessing/global-timer/global-timer-registers对于从https://developer.arm.com/documentation/100400/0001/multiprocessing/global-timer/global-timer-registers中提到的两个 32 位定时器计数器读取 64 位定时器值的序列

What is the correct way to insert ARM64 memory barriers between the reads?在读取之间插入 ARM64 memory 障碍的正确方法是什么？

Is something like below proper?像下面这样的东西合适吗？ Can someone please explain how and what data memory barriers to use in this case?有人可以解释一下在这种情况下如何以及使用什么数据 memory 障碍吗？

do {
  high1 = read(base+4);
  asm volatile("dmb sy");
  low = read(base);
  asm volatile("dmb sy");
  high2 = read(base+4);
  asm volatile("dmb sy");
} while (high2 != high1);

I know question on how to read 64bit timer already exists but there is no detail of memory barrier usage there and I need this for ARM machines - How to read two 32bit counters as a 64bit integer without race condition我知道关于如何读取 64 位计时器的问题已经存在，但那里没有 memory 屏障用法的详细信息，我需要它用于 ARM 机器 - 如何在没有竞争条件的情况下将两个 32 位计数器读取为 64 位 integer

Answer 1

There are different types of memory mapping.有不同类型的 memory 映射。 Each type defines how memory access is made and possible reordering of reading/writing.每种类型都定义了如何进行 memory 访问以及可能的读/写重新排序。

Reordering in this case for example when instruction sequence high1 = read(base+4); low = read(base);在这种情况下重新排序，例如当指令序列high1 = read(base+4); low = read(base); high1 = read(base+4); low = read(base); is performed by CPU like low = read(base); high1 = read(base+4);由 CPU 执行，如low = read(base); high1 = read(base+4); low = read(base); high1 = read(base+4); . . And that's perfectly reasonable from performance point of view.从性能的角度来看，这是完全合理的。 At stage when CPU trying to execute while (high2;= high1);在 CPU 尝试执行while (high2;= high1); generally it does not matter what register was assigned first 'low' or 'high1'.通常，首先分配给哪个寄存器“low”或“high1”并不重要。 Basically CPU simply is not aware about interdependence between 2 words.基本上 CPU 根本不知道两个词之间的相互依赖关系。

For this 64bit value situation, we should take extra steps to prevent CPU to remove this register dependency.对于这种 64 位值的情况，我们应该采取额外的措施来防止 CPU 移除这种寄存器依赖性。

First and 'the most right' way is to map timer as 'Device' memory. Usually all hardware mapped memory is made 'device' memory. 'Device' memory mapping guaranty strict memory ordering.首先，“最正确”的方法是将 map 计时器作为“设备”memory。通常所有映射到 memory 的硬件都是“设备”memory。“设备”memory 映射保证严格 883933882533 So CPU would not do any reordering of memory reading (or writing or both) and it's always will be high1 , low , high2 .因此 CPU 不会对 memory 读取（或写入或两者）进行任何重新排序，它始终为high2 、 low 、 high1 。 Device memory is also uncacheable.设备 memory 也是不可缓存的。 It does not matter in this case but for something using DMA for instance, that saves from maintain cache-memory consistency.在这种情况下无关紧要，但对于使用 DMA 的东西来说，可以避免保持缓存内存一致性。 As a conclusion, any sync barriers for 'device' memory are redundant in this case.总之，在这种情况下，“设备”memory 的任何同步障碍都是多余的。

If one want to go for troubles, hardware might be mapped as 'generic'/'common' memory. For 'generic' memory reordering is allowed.如果想要 go 遇到麻烦，硬件可能会映射为“通用”/“通用”memory。对于“通用”memory，允许重新排序。 So you might finish with following situation.所以你可能会遇到以下情况。 Say we have counter value like 0000-9999 (decimal, 4digits for high and 4 digits for low).假设我们有像0000-9999这样的计数器值（十进制，高 4 位，低 4 位）。

high1 = read(base+4); low = read(base); is reordered and executed as low = read(base); high1 = read(base+4);重新排序并执行为low = read(base); high1 = read(base+4); low = read(base); high1 = read(base+4);
low is read as 9999 , after reading is finished timer is incremented. low 读取为9999 ，读取完成后计时器递增。
now timer is 0001-0000现在定时器是0001-0000
high is read as 0001高读为0001
and we have 0001-9999 Reading high2 would give 0001 again and life getting very interesting from this point.我们有0001-9999 Reading high2会再次给出0001 ，从这一点开始生活变得非常有趣。

So as I see it's necessary to prevent reordering of reading high1 and low , as well as low and high2 because we could get 0001-9999 situation in both cases (well for second case it would be high1=0000, high2=0000 and low=0000 with missing 0001 placed in high ).因此，正如我所见，有必要防止重新排序读取high1和low以及low和high2 ，因为在这两种情况下我们都可以获得0001-9999的情况（对于第二种情况，它将是 high1=0000、high2=0000 和 low= 0000 缺失0001置于high ）。

So I'd say所以我会说

do {
  high1 = read(base+4);
  asm volatile("dmb sy");
  low = read(base);
  asm volatile("dmb sy");
  high2 = read(base+4);
  // asm volatile("dmb sy"); This looks like excessive
} while (high2 != high1);

PS: it does not look like you need such strict ordering as sy , very minimal one that guarantee ordering on specific CPU should be sufficient. PS：看起来您不需要像sy这样严格的排序，保证在特定 CPU 上的排序应该足够的最小排序。

从两个 32 位定时器计数器读取 64 位定时器值时，正确的 ARM64(AArch64) 数据 memory 屏障用法是什么？

问题描述

1 个解决方案

解决方案1
1 已采纳 2022-04-18 05:08:16

从两个 32 位定时器计数器读取 64 位定时器值时，正确的 ARM64(AArch64) 数据 memory 屏障用法是什么？

问题描述

1 个解决方案

解决方案1 1 已采纳 2022-04-18 05:08:16

解决方案1
1 已采纳 2022-04-18 05:08:16