Delphi 10.2 for Win64 Release Target 下的 FillChar 和 StringOfChar

Question

I have a question about a specific programming problem in Delphi 10.2 Pascal programming language.我有一个关于 Delphi 10.2 Pascal 编程语言中的特定编程问题的问题。

The StringOfChar and FillChar don't work properly under Win64 Release build on CPUs released before year 2012. StringOfChar 和 FillChar 在 2012 年之前发布的 CPU 上的 Win64 Release 版本下无法正常工作。

Expected result of FillChar is just plain sequence of just repeating 8-bit characters in a given memory buffer. FillChar 的预期结果只是在给定内存缓冲区中重复 8 位字符的简单序列。
Expected result of StringOfChar is the same, but the result is stored inside a string type. StringOfChar 的预期结果相同，但结果存储在字符串类型中。

But, in fact, when I compile our applications that worked in Delphi prior to 10.2 by the 10.2 version of Delphi, our applications compiled for Win64 stop working properly on CPUs released before year 2012.但是，实际上，当我通过 10.2 版本的 Delphi 编译在 10.2 之前在 Delphi 中运行的应用程序时，我们为 Win64 编译的应用程序在 2012 年之前发布的 CPU 上无法正常工作。

The StringOfChar and FillChar don't work properly – they return a string of different characters, although in a repeating pattern – not just a sequence of the same character as they should. StringOfChar 和 FillChar 不能正常工作——它们返回一个由不同字符组成的字符串，虽然是重复的模式——而不是它们应该返回的相同字符的序列。

Here is the minimal code enough to demonstrate the issue.这是足以说明问题的最少代码。 Please note that the length of the sequence should be at least 16 characters, and the character should not be nul (#0).请注意，序列的长度至少应为 16 个字符，且字符不应为空（#0）。 The code is below:代码如下：

procedure TestStringOfChar;
var
  a: AnsiString;
  ac: AnsiChar;
begin
  ac := #1;
  a := StringOfChar(ac, 43);
  if a <> #1#1#1#1#1#1#1#1#1#1#1#1#1#1#1#1#1#1#1#1#1#1#1#1#1#1#1#1#1#1#1#1#1#1#1#1#1#1#1#1#1#1#1 then
  begin
    raise Exception.Create('ANSI StringOfChar Failed!!');
  end;
end;

I know that there are lots of Delphi programmers at StackOverflow.我知道 StackOverflow 有很多 Delphi 程序员。 Are you experiencing the same problem?您是否遇到同样的问题？ If yes, how you resolve it?如果是，你是如何解决的？ What is the solution?解决办法是什么？ By the way, I have contacted the developers of Delphi but they didn't confirm nor deny the issue so far .顺便说一下，我已经联系了 Delphi 的开发人员，但他们目前没有确认或否认这个问题。 I'm using Embarcadero Delphi 10.2 Version 25.0.26309.314.我正在使用 Embarcadero Delphi 10.2 版本 25.0.26309.314。

Update:更新：

If your CPU is manufactured in 2012 or later, additionally include the following lines before calling StringOfChar to reproduce the issue:如果您的 CPU 是 2012 年或之后制造的，请在调用 StringOfChar 重现问题之前额外添加以下行：

const
  ERMSBBit    = 1 shl 9; //$0200
begin
  CPUIDTable[7].EBX := CPUIDTable[7].EBX and not ERMSBBit;

As about the April 2017 RAD Studio 10.2 Hotfix for Toolchain Issues - have tried with it and without it - it didn't help.至于2017 年 4 月的 RAD Studio 10.2 工具链问题修补程序- 尝试过和不使用它 - 它没有帮助。 The issue exists regardless of the Hotfix.无论修补程序如何，问题都存在。

Update #2更新 #2

Embarcadero has confirmed and resolved this issue on 08/Aug/17 6:03 PM. Embarcadero 已于 08/8/17 下午 6:03 确认并解决了此问题。 So, in Delphi 10.2 Tokyo Release 1 (released on August 8, 2017) this bug is fixed.因此，在 Delphi 10.2 Tokyo Release 1（2017 年 8 月 8 日发布）中，此错误已修复。

Answer 1

StringOfChar(A: AnsiChar, count) uses FillChar under the hood. StringOfChar(A: AnsiChar, count)在幕后使用 FillChar。

You can use the following code to fix the issue:您可以使用以下代码来解决此问题：

(*******************************************************
 System.FastSystem
 A fast drop-in addition to speed up function in system.pas
 It should compile and run in XE2 and beyond.
 Alpha version 0.5, fully tested in Win64
 (c) Copyright 2016 J. Bontes
   This Source Code Form is subject to the terms of the
   Mozilla Public License, v. 2.0.
   If a copy of the MPL was not distributed with this file,
   You can obtain one at http://mozilla.org/MPL/2.0/.
********************************************************
FillChar code is an altered version FillCharsse2 SynCommons.pas
which is part of Synopse framework by Arnaud Bouchez
********************************************************
Changelog
0.5 Initial version:
********************************************************)

unit FastSystem;

interface

procedure FillChar(var Dest; Count: NativeInt; Value: ansichar); inline; overload;
procedure FillChar(var Dest; Count: NativeInt; Value: Byte); overload;
procedure FillMemory(Destination: Pointer; Length: NativeUInt; Fill: Byte); inline;
{$EXTERNALSYM FillMemory}
procedure ZeroMemory(Destination: Pointer; Length: NativeUInt); inline;
{$EXTERNALSYM ZeroMemory}

implementation

procedure FillChar(var Dest; Count: NativeInt; Value: ansichar); inline; overload;
begin
  FillChar(Dest, Count, byte(Value));
end;

procedure FillMemory(Destination: Pointer; Length: NativeUInt; Fill: Byte);
begin
  FillChar(Destination^, Length, Fill);
end;

procedure ZeroMemory(Destination: Pointer; Length: NativeUInt); inline;
begin
  FillChar(Destination^, Length, 0);
end;

//This code is 3x faster than System.FillChar on x64.

{$ifdef CPUX64}
procedure FillChar(var Dest; Count: NativeInt; Value: Byte);
//rcx = dest
//rdx=count
//r8b=value
asm
              .noframe
              .align 16
              movzx r8,r8b           //There's no need to optimize for count <= 3
              mov rax,$0101010101010101
              mov r9d,edx
              imul rax,r8            //fill rax with value.
              cmp rdx,59             //Use simple code for small blocks.
              jl  @Below32
@Above32:     mov r11,rcx
              mov r8b,7              //code shrink to help alignment.
              lea r9,[rcx+rdx]       //r9=end of array
              sub rdx,8
              rep mov [rcx],rax
              add rcx,8
              and r11,r8             //and 7 See if dest is aligned
              jz @tail
@NotAligned:  xor rcx,r11            //align dest
              lea rdx,[rdx+r11]
@tail:        test r9,r8             //and 7 is tail aligned?
              jz @alignOK
@tailwrite:   mov [r9-8],rax         //no, we need to do a tail write
              and r9,r8              //and 7
              sub rdx,r9             //dec(count, tailcount)
@alignOK:     mov r10,rdx
              and edx,(32+16+8)      //count the partial iterations of the loop
              mov r8b,64             //code shrink to help alignment.
              mov r9,rdx
              jz @Initloop64
@partialloop: shr r9,1              //every instruction is 4 bytes
              lea r11,[rip + @partial +(4*7)] //start at the end of the loop
              sub r11,r9            //step back as needed
              add rcx,rdx            //add the partial loop count to dest
              cmp r10,r8             //do we need to do more loops?
              jmp r11                //do a partial loop
@Initloop64:  shr r10,6              //any work left?
              jz @done               //no, return
              mov rdx,r10
              shr r10,(19-6)         //use non-temporal move for > 512kb
              jnz @InitFillHuge
@Doloop64:    add rcx,r8
              dec edx
              mov [rcx-64+00H],rax
              mov [rcx-64+08H],rax
              mov [rcx-64+10H],rax
              mov [rcx-64+18H],rax
              mov [rcx-64+20H],rax
              mov [rcx-64+28H],rax
              mov [rcx-64+30H],rax
              mov [rcx-64+38H],rax
              jnz @DoLoop64
@done:        rep ret
              //db $66,$66,$0f,$1f,$44,$00,$00 //nop7
@partial:     mov [rcx-64+08H],rax
              mov [rcx-64+10H],rax
              mov [rcx-64+18H],rax
              mov [rcx-64+20H],rax
              mov [rcx-64+28H],rax
              mov [rcx-64+30H],rax
              mov [rcx-64+38H],rax
              jge @Initloop64        //are we done with all loops?
              rep ret
              db $0F,$1F,$40,$00
@InitFillHuge:
@FillHuge:    add rcx,r8
              dec rdx
              db $48,$0F,$C3,$41,$C0 // movnti  [rcx-64+00H],rax
              db $48,$0F,$C3,$41,$C8 // movnti  [rcx-64+08H],rax
              db $48,$0F,$C3,$41,$D0 // movnti  [rcx-64+10H],rax
              db $48,$0F,$C3,$41,$D8 // movnti  [rcx-64+18H],rax
              db $48,$0F,$C3,$41,$E0 // movnti  [rcx-64+20H],rax
              db $48,$0F,$C3,$41,$E8 // movnti  [rcx-64+28H],rax
              db $48,$0F,$C3,$41,$F0 // movnti  [rcx-64+30H],rax
              db $48,$0F,$C3,$41,$F8 // movnti  [rcx-64+38H],rax
              jnz @FillHuge
@donefillhuge:mfence
              rep ret
              db $0F,$1F,$44,$00,$00  //db $0F,$1F,$40,$00
@Below32:     and  r9d,not(3)
              jz @SizeIs3
@FillTail:    sub   edx,4
              lea   r10,[rip + @SmallFill + (15*4)]
              sub   r10,r9
              jmp   r10
@SmallFill:   rep mov [rcx+56], eax
              rep mov [rcx+52], eax
              rep mov [rcx+48], eax
              rep mov [rcx+44], eax
              rep mov [rcx+40], eax
              rep mov [rcx+36], eax
              rep mov [rcx+32], eax
              rep mov [rcx+28], eax
              rep mov [rcx+24], eax
              rep mov [rcx+20], eax
              rep mov [rcx+16], eax
              rep mov [rcx+12], eax
              rep mov [rcx+08], eax
              rep mov [rcx+04], eax
              mov [rcx],eax
@Fallthough:  mov [rcx+rdx],eax  //unaligned write to fix up tail
              rep ret

@SizeIs3:     shl edx,2           //r9 <= 3  r9*4
              lea r10,[rip + @do3 + (4*3)]
              sub r10,rdx
              jmp r10
@do3:         rep mov [rcx+2],al
@do2:         mov [rcx],ax
              ret
@do1:         mov [rcx],al
              rep ret
@do0:         rep ret
end;
{$endif}

The easiest way to fix your issue is to Download Mormot and include SynCommon.pas into your project.解决问题的最简单方法是下载 Mormot并将SynCommon.pas包含到您的项目中。 This will patch System.FillChar to the above code and include a couple of other performance improvements as well.这会将 System.FillChar 修补到上述代码中，并包括一些其他性能改进。

Note that you don't need all of Mormot, just SynCommons by itself.请注意，您不需要所有 Mormot，只需要 SynCommons 本身。

Answer 2

I took the test case from the FastCode Challenge - http://fastcode.sourceforge.net/我从 FastCode Challenge 中获取了测试用例 - http://fastcode.sourceforge.net/

I have compiled the FillChar testing tool under Win64, and removed all 32-bit versions of FillChar present in the test.我已经在 Win64 下编译了 FillChar 测试工具，并删除了测试中存在的所有 32 位版本的 FillChar。

I have left 2 versions of 64-bit FillChar:我留下了 2 个版本的 64 位 FillChar：

FC_TokyoBugfixAVXEx - the one present in Delphi Tokyo 64-bit, with bugs fixed and AVX registers added. FC_TokyoBugfixAVXEx - Delphi Tokyo 64 位中的一个，修复了错误并添加了 AVX 寄存器。 There is branching to detect ERMSB, AVX1 and AVX2 CPU capabilities.有分支来检测 ERMSB、AVX1 和 AVX2 CPU 功能。 This branching happens on each FillChar call.这种分支发生在每次 FillChar 调用上。 There is no entry point patching or function address mapping.没有入口点修补或函数地址映射。
FillChar_J_Bontes - another version of FillChar, the function from System.FastSystem that you have posted here. FillChar_J_Bontes - FillChar_J_Bontes另一个版本，您在此处发布的 System.FastSystem 中的函数。

I didn't test vanilla FillChar from Delphi Tokyo, because it contains a bug described in my initial post, and it improperly handles ERMSB.我没有测试来自 Delphi Tokyo 的 vanilla FillChar，因为它包含在我最初的帖子中描述的错误，并且它不正确地处理 ERMSB。

Kaby Lake - i7-7700K卡比湖 - i7-7700K

First column is the alignment of the function.第一列是函数的对齐方式。 Next 4 columns are results of various tests, lower is better.接下来的4列是各种测试的结果，越低越好。 There are 4 tests in total.总共有4个测试。 First test operates with smaller block, second with larger, and so on.第一个测试使用较小的块，第二个使用较大的块，依此类推。 Last column is a weighted summary of all tests.最后一列是所有测试的加权汇总。

The CPU in the first test is Kaby Lake i7-7700K (January 2017).第一次测试的CPU是Kaby Lake i7-7700K（2017年1月）。 Frequency 4.2 GHz (turbo frequency up to 4.5 GHz), L2 cache 4 × 256 KB, L3 cache 8 MB.频率 4.2 GHz（睿频高达 4.5 GHz），L2 缓存 4 × 256 KB，L3 缓存 8 MB。

Ivy Bridge - E5-2603 v2常春藤桥 - E5-2603 v2

Here are the results of a second test, on a previous microarchitecture: Xeon E5-2603 v2 "Ivy Bridge" (September 2013), frequency 1.8 GHz, L2 Cache 4 × 256 KB, L3 Cache 10 MB, RAM 4 × DDR3-1333.以下是之前微架构的第二次测试结果：至强 E5-2603 v2“Ivy Bridge”（2013 年 9 月），频率 1.8 GHz，L2 Cache 4 × 256 KB，L3 Cache 10 MB，RAM 4 × DDR3-1333 .

Ivy Bridge - E5-2643 v2常春藤桥 - E5-2643 v2

Here are the test results on a third set of hardware: Intel Xeon E5-2643 v2 (September 2013), frequency 3.5 GHz, L2 Cache 6 × 256 KB, L3 Cache 25 MB, RAM 4 × DDR3-1600.以下是在第三组硬件上的测试结果：Intel Xeon E5-2643 v2（2013 年 9 月），频率 3.5 GHz，L2 Cache 6 × 256 KB，L3 Cache 25 MB，RAM 4 × DDR3-1600。

Intel Core i9 7900X英特尔酷睿 i9 7900X

Here are the test results on a fourth set of hardware: Intel Core i9 7900X (June 2017), frequency 3.3 GHz (turbo frequency up to 4.5 GHz), L2 Cache 10 × 1024 KB, L3 Cache 13.75 MB, RAM 4 × DDR4-2134.以下是在第四组硬件上的测试结果：Intel Core i9 7900X（2017 年 6 月），频率 3.3 GHz（睿频高达 4.5 GHz），L2 Cache 10 × 1024 KB，L3 Cache 13.75 MB，RAM 4 × DDR4- 2134.

Delphi 10.2 for Win64 Release Target 下的 FillChar 和 StringOfChar

问题描述

2 个解决方案

解决方案1
11 2017-05-15 10:20:21

解决方案2
2 2017-06-19 06:59:21

Kaby Lake - i7-7700K卡比湖 - i7-7700K

Ivy Bridge - E5-2603 v2常春藤桥 - E5-2603 v2

Ivy Bridge - E5-2643 v2常春藤桥 - E5-2643 v2

Intel Core i9 7900X英特尔酷睿 i9 7900X

Delphi 10.2 for Win64 Release Target 下的 FillChar 和 StringOfChar

问题描述

2 个解决方案

解决方案1 11 2017-05-15 10:20:21

解决方案2 2 2017-06-19 06:59:21

Kaby Lake - i7-7700K卡比湖 - i7-7700K

Ivy Bridge - E5-2603 v2常春藤桥 - E5-2603 v2

Ivy Bridge - E5-2643 v2常春藤桥 - E5-2643 v2

Intel Core i9 7900X英特尔酷睿 i9 7900X

解决方案1
11 2017-05-15 10:20:21

解决方案2
2 2017-06-19 06:59:21