简体   繁体   English

不同大小的阵列和写入方法的 RAM 写入速度不一致。 填充数组的最快方法?

[英]Inconsistent ram write speeds across different size array and method of writing. Fastest way to fill array?

I've encountered some inconsistencies in writing speed across different array sizes and methods, so i made a separate program to investigate, its a simple c# console application which creates a byte, int and float array of a specified size, then fills it with a defined value using different methods and measures write speed, in between each method call it also clears the cpu cache for unadulterated results when running the same method n times.我在不同的数组大小和方法中遇到了一些写入速度不一致的问题,所以我制作了一个单独的程序来调查,它是一个简单的 c# 控制台应用程序,它创建一个指定大小的字节、整数和浮点数组,然后用一个使用不同的方法定义值并测量写入速度,在每次方法调用之间,当运行相同的方法 n 次时,它还会清除 cpu 缓存以获得纯正的结果。

Here is the pastebin dump of the program.cs the only file in a simple program, just copy paste it into a new empty console application in visual studio这是程序的 pastebin 转储。cs是一个简单程序中唯一的文件,只需将其复制粘贴到 visual studio 中的一个新的空控制台应用程序中

The byte/int arrays are meant to be bitmap data, and the float array a depthbuffer, so the byteframe is width * height * 4 for bgra, and the int/float frame is just width * height in size. byte/int arrays 意味着是 bitmap 数据,float 数组是一个深度缓冲区,所以 bgra 的字节帧是 width * height * 4,而 int/float 帧的大小只是 width * height。

This is the results i get on my computer:这是我在电脑上得到的结果:

500*500
ClearByteFrameAsSpanFill:              MS:0,05 GB/s:18,79 MB/s:18794,50
ClearByteFrameArrayFill:               MS:0,05 GB/s:18,54 MB/s:18541,56
ClearByteFrameAsByteAVX:               MS:0,05 GB/s:19,57 MB/s:19575,94
ClearByteFrameAsByteAVXUnRolled:       MS:0,05 GB/s:18,14 MB/s:18148,42
ClearByteFrameAsByteAVXThreaded(2):    MS:0,08 GB/s:11,96 MB/s:11968,89
ClearByteFrameAsByteAVXThreaded(4):    MS:0,08 GB/s:12,19 MB/s:12192,34
ClearByteFrameAsByteAVXThreaded(6):    MS:0,08 GB/s:12,53 MB/s:12537,63
ClearByteFrameAsByteAVXThreaded(8):    MS:0,08 GB/s:12,37 MB/s:12377,25
ClearByteFrameAsByteAVXThreaded(12):   MS:0,08 GB/s:12,13 MB/s:12137,08
ClearByteFrameMarshalCopy:             MS:0,09 GB/s:10,35 MB/s:10357,00
ClearByteFrameMarshalCopy4:            MS:0,11 GB/s:9,22 MB/s:9220,86
ClearByteFrameMarshalThreaded(2):      MS:0,14 GB/s:7,17 MB/s:7171,80
ClearByteFrameMarshalThreaded(4):      MS:0,12 GB/s:7,89 MB/s:7892,86
ClearByteFrameMarshalThreaded(6):      MS:0,12 GB/s:7,99 MB/s:7993,24
ClearByteFrameMarshalThreaded(8):      MS:0,13 GB/s:7,28 MB/s:7287,14
ClearByteFrameMarshalThreaded(12):     MS:0,13 GB/s:7,48 MB/s:7485,39
ClearByteFrameNaive:                   MS:0,30 GB/s:3,30 MB/s:3308,06
ClearByteFrameBytePointer:             MS:0,25 GB/s:3,97 MB/s:3970,49
ClearByteFrameInt32Pointer:            MS:0,07 GB/s:13,98 MB/s:13981,94
ClearByteFrameInt32PointerThreads(2):  MS:0,07 GB/s:12,58 MB/s:12582,46
ClearByteFrameInt32PointerThreads(4):  MS:0,08 GB/s:11,93 MB/s:11938,29
ClearByteFrameInt32PointerThreads(6):  MS:0,08 GB/s:12,14 MB/s:12144,36
ClearByteFrameInt32PointerThreads(8):  MS:0,08 GB/s:11,62 MB/s:11629,97
ClearByteFrameInt32PointerThreads(12): MS:0,09 GB/s:11,38 MB/s:11388,57
ClearByteFrameInt64Pointer:            MS:0,05 GB/s:18,69 MB/s:18692,43
ClearByteFrameInt64PointerThreads(2):  MS:0,07 GB/s:12,91 MB/s:12917,26
ClearByteFrameInt64PointerThreads(4):  MS:0,08 GB/s:12,47 MB/s:12477,97
ClearByteFrameInt64PointerThreads(6):  MS:0,10 GB/s:11,43 MB/s:11439,57
ClearByteFrameInt64PointerThreads(8):  MS:0,08 GB/s:11,50 MB/s:11504,56
ClearByteFrameInt64PointerThreads(12): MS:0,09 GB/s:11,05 MB/s:11052,96

ClearInt32FrameAsSpanFill:             MS:0,05 GB/s:19,25 MB/s:19256,44
ClearInt32FrameArrayFill:              MS:0,05 GB/s:18,80 MB/s:18803,56

ClearFloatFrameAsSpanFill:             MS:0,05 GB/s:19,24 MB/s:19245,53
ClearFloatFrameArrayFill:              MS:0,05 GB/s:19,29 MB/s:19296,62

-- --

1500*1500
ClearByteFrameAsSpanFill:              MS:0,47 GB/s:18,96 MB/s:18963,17
ClearByteFrameArrayFill:               MS:0,48 GB/s:18,63 MB/s:18638,61
ClearByteFrameAsByteAVX:               MS:0,46 GB/s:19,54 MB/s:19547,95
ClearByteFrameAsByteAVXUnRolled:       MS:0,47 GB/s:18,98 MB/s:18987,69
ClearByteFrameAsByteAVXThreaded(2):    MS:0,55 GB/s:16,71 MB/s:16715,60
ClearByteFrameAsByteAVXThreaded(4):    MS:0,53 GB/s:17,09 MB/s:17097,25
ClearByteFrameAsByteAVXThreaded(6):    MS:0,53 GB/s:17,10 MB/s:17106,20
ClearByteFrameAsByteAVXThreaded(8):    MS:0,48 GB/s:19,06 MB/s:19066,60
ClearByteFrameAsByteAVXThreaded(12):   MS:0,51 GB/s:17,75 MB/s:17759,48
ClearByteFrameMarshalCopy:             MS:0,81 GB/s:11,05 MB/s:11059,42
ClearByteFrameMarshalCopy4:            MS:0,80 GB/s:11,20 MB/s:11207,69
ClearByteFrameMarshalThreaded(2):      MS:0,83 GB/s:11,14 MB/s:11142,98
ClearByteFrameMarshalThreaded(4):      MS:0,79 GB/s:11,49 MB/s:11494,85
ClearByteFrameMarshalThreaded(6):      MS:0,87 GB/s:10,37 MB/s:10372,56
ClearByteFrameMarshalThreaded(8):      MS:0,85 GB/s:10,69 MB/s:10692,81
ClearByteFrameMarshalThreaded(12):     MS:0,86 GB/s:10,49 MB/s:10497,28
ClearByteFrameNaive:                   MS:2,67 GB/s:3,35 MB/s:3359,32
ClearByteFrameBytePointer:             MS:2,24 GB/s:4,01 MB/s:4015,63
ClearByteFrameInt32Pointer:            MS:0,62 GB/s:14,34 MB/s:14340,04
ClearByteFrameInt32PointerThreads(2):  MS:0,50 GB/s:18,05 MB/s:18052,87
ClearByteFrameInt32PointerThreads(4):  MS:0,49 GB/s:18,30 MB/s:18306,56
ClearByteFrameInt32PointerThreads(6):  MS:0,46 GB/s:19,37 MB/s:19378,58
ClearByteFrameInt32PointerThreads(8):  MS:0,48 GB/s:18,48 MB/s:18483,02
ClearByteFrameInt32PointerThreads(12): MS:0,49 GB/s:18,26 MB/s:18266,97
ClearByteFrameInt64Pointer:            MS:0,46 GB/s:19,26 MB/s:19264,51
ClearByteFrameInt64PointerThreads(2):  MS:0,51 GB/s:17,59 MB/s:17599,42
ClearByteFrameInt64PointerThreads(4):  MS:0,50 GB/s:18,07 MB/s:18075,17
ClearByteFrameInt64PointerThreads(6):  MS:0,47 GB/s:19,19 MB/s:19196,42
ClearByteFrameInt64PointerThreads(8):  MS:0,49 GB/s:18,27 MB/s:18273,37
ClearByteFrameInt64PointerThreads(12): MS:0,48 GB/s:18,49 MB/s:18499,04

ClearInt32FrameAsSpanFill:             MS:0,47 GB/s:18,98 MB/s:18987,70
ClearInt32FrameArrayFill:              MS:0,48 GB/s:18,70 MB/s:18702,73

ClearFloatFrameAsSpanFill:             MS:0,48 GB/s:18,60 MB/s:18609,78
ClearFloatFrameArrayFill:              MS:0,47 GB/s:19,07 MB/s:19077,63

-- --

4500*4500
ClearByteFrameAsSpanFill:              MS:3,49 GB/s:23,25 MB/s:23251,93
ClearByteFrameArrayFill:               MS:3,45 GB/s:23,47 MB/s:23473,07
ClearByteFrameAsByteAVX:               MS:7,24 GB/s:11,20 MB/s:11200,27
ClearByteFrameAsByteAVXUnRolled:       MS:7,32 GB/s:11,08 MB/s:11081,40
ClearByteFrameAsByteAVXThreaded(2):    MS:6,93 GB/s:11,70 MB/s:11702,41
ClearByteFrameAsByteAVXThreaded(4):    MS:6,44 GB/s:12,58 MB/s:12588,30
ClearByteFrameAsByteAVXThreaded(6):    MS:6,48 GB/s:12,53 MB/s:12536,29
ClearByteFrameAsByteAVXThreaded(8):    MS:6,49 GB/s:12,47 MB/s:12479,53
ClearByteFrameAsByteAVXThreaded(12):   MS:6,59 GB/s:12,28 MB/s:12286,35
ClearByteFrameMarshalCopy:             MS:7,19 GB/s:11,27 MB/s:11270,57
ClearByteFrameMarshalCopy4:            MS:7,25 GB/s:11,17 MB/s:11179,60
ClearByteFrameMarshalThreaded(2):      MS:7,15 GB/s:11,33 MB/s:11337,41
ClearByteFrameMarshalThreaded(4):      MS:7,38 GB/s:10,97 MB/s:10970,25
ClearByteFrameMarshalThreaded(6):      MS:7,58 GB/s:10,68 MB/s:10682,29
ClearByteFrameMarshalThreaded(8):      MS:7,64 GB/s:10,60 MB/s:10601,81
ClearByteFrameMarshalThreaded(12):     MS:7,78 GB/s:10,40 MB/s:10400,19
ClearByteFrameNaive:                   MS:24,18 GB/s:3,34 MB/s:3348,89
ClearByteFrameBytePointer:             MS:20,35 GB/s:3,97 MB/s:3979,91
ClearByteFrameInt32Pointer:            MS:8,10 GB/s:10,00 MB/s:10009,30
ClearByteFrameInt32PointerThreads(2):  MS:7,00 GB/s:11,57 MB/s:11575,67
ClearByteFrameInt32PointerThreads(4):  MS:6,29 GB/s:12,86 MB/s:12868,42
ClearByteFrameInt32PointerThreads(6):  MS:6,50 GB/s:12,48 MB/s:12485,34
ClearByteFrameInt32PointerThreads(8):  MS:6,49 GB/s:12,49 MB/s:12496,20
ClearByteFrameInt32PointerThreads(12): MS:6,64 GB/s:12,19 MB/s:12194,29
ClearByteFrameInt64Pointer:            MS:7,16 GB/s:11,33 MB/s:11331,90
ClearByteFrameInt64PointerThreads(2):  MS:6,80 GB/s:11,92 MB/s:11926,33
ClearByteFrameInt64PointerThreads(4):  MS:6,40 GB/s:12,67 MB/s:12670,50
ClearByteFrameInt64PointerThreads(6):  MS:6,51 GB/s:12,49 MB/s:12490,41
ClearByteFrameInt64PointerThreads(8):  MS:6,48 GB/s:12,51 MB/s:12511,71
ClearByteFrameInt64PointerThreads(12): MS:6,61 GB/s:12,25 MB/s:12256,23

ClearInt32FrameAsSpanFill:             MS:7,41 GB/s:10,94 MB/s:10948,01
ClearInt32FrameArrayFill:              MS:7,28 GB/s:11,13 MB/s:11133,96

ClearFloatFrameAsSpanFill:             MS:7,23 GB/s:11,20 MB/s:11202,94
ClearFloatFrameArrayFill:              MS:7,19 GB/s:11,26 MB/s:11265,38

For reference my computer is an amd ryzen 3600 and independent benchmarking software puts my max write speed at about 20 GB/s作为参考,我的电脑是 amd ryzen 3600,独立的基准测试软件将我的最大写入速度设置为大约 20 GB/s

My question is as follows:我的问题如下:

  1. Why does ClearByteFrameAsSpanFill give good performance on all 3 sizes but int32frame and floatframe using same AsSpan fill method caps at 10GB/s on the large size为什么 ClearByteFrameAsSpanFill 在所有 3 种尺寸上都能提供良好的性能,但使用相同 AsSpan 填充方法的 int32frame 和 floatframe 在大尺寸上的上限为 10GB/s
  2. Why does ClearByteFrameAsByteAVX, ClearByteFrameInt32Pointer and ClearByteFrameInt64Pointer do well on the medium and small size, but not on large?为什么 ClearByteFrameAsByteAVX、ClearByteFrameInt32Pointer 和 ClearByteFrameInt64Pointer 在 medium 和 small 上表现很好,但在 large 上表现不佳? What could be causing this?是什么原因造成的? It's literally the simplest straight forward functions you can imagine, write to a pointer, increment pointer and repeat, array size should have no impact at all.它实际上是您可以想象的最简单的直接函数,写入指针、递增指针并重复,数组大小应该没有任何影响。
  3. Why does the different methods that come close to the 20GBs mark on the medium test all cap together at around 10GBs mark when slow, is there some secret behind the scene mechanism that caps writing speeds at 50% under certain circumstances?为什么在介质测试中接近20GBs标记的不同方法在缓慢时都加在一起在10GBs左右标记,在某些情况下写入速度上限为50%的幕后机制是否有一些秘密?

The only winner is ClearByteFrameAsSpanFill and ClearByteFrameArrayFill, but it is useless since i must reset the buffer with different 4 values per pixel.唯一的赢家是 ClearByteFrameAsSpanFill 和 ClearByteFrameArrayFill,但它没有用,因为我必须用每个像素 4 个不同的值重置缓冲区。 I thought i could use an int32 array instead of bytes, and just bitshift my rgba values into a single int and clear with that, but it fails on the large size.我以为我可以使用 int32 数组而不是字节,只需将我的 rgba 值位移成单个 int 并用它清除,但它在大尺寸上失败。 ClearByteFrameAsByteAVX works small and medium, but fails on large. ClearByteFrameAsByteAVX 在小型和中型上工作,但在大型上失败。

I need a singular method that performs well across all sizes, and i am morbidly curious why it behaves like this, you would think something as basic as this would be well optimized in the .net framework.我需要一种在所有规模上都表现良好的单一方法,我病态地好奇为什么它会这样,你会认为像这样基本的东西会在 .net 框架中得到很好的优化。

Any help would be appreciated.任何帮助,将不胜感激。 Please make a new console application and run the program, see if you get the same problem.请制作一个新的控制台应用程序并运行该程序,看看您是否遇到同样的问题。

You can use AsSpan approach in combination with MemoryMarshal.Cast to use int or long as fill values:您可以将AsSpan方法与MemoryMarshal.Cast结合使用,以使用intlong作为填充值:

static double ClearByteFrameAsIntSpanFill(int clearValue) 
{
    var asSpan = ByteFrame.AsSpan();
    var cast = MemoryMarshal.Cast<byte, int>(asSpan);
    cast.Fill(clearValue);
    return (float)ByteFrame.Length / 1000000;
}

static double ClearByteFrameAsLongSpanFill(long clearValue) 
{
    var asSpan = ByteFrame.AsSpan();
    var cast = MemoryMarshal.Cast<byte, long>(asSpan);
    cast.Fill(clearValue);
    return (float)ByteFrame.Length / 1000000;
}

I've run few tests using your code and it seems to have performance similar to ClearByteFrameAsSpanFill / ClearByteFrameArrayFill on my machine.我使用您的代码进行了几次测试,它的性能似乎与我机器上的ClearByteFrameAsSpanFill / ClearByteFrameArrayFill相似。

Though in general I recommend using BenchmarkDotNet for performance testing/investigation.虽然一般来说我建议使用BenchmarkDotNet进行性能测试/调查。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM