简体   繁体   English

C# stackalloc 比常规变量慢?

[英]C# stackalloc slower than regular variables?

I have 2 functions implementing uint128 multiplication in 2 different ways: one is using variables, the other using stackalloc "arrays".我有 2 个函数以 2 种不同的方式实现 uint128 乘法:一种使用变量,另一种使用 stackalloc“数组”。

Variable Version可变版本

public static UInt128 operator *(UInt128 i, UInt128 j) {

 ulong I0 = i._uint0; ulong I1 = i._uint1; ulong I2 = i._uint2; ulong I3 = i._uint3;
 ulong J0 = j._uint0; ulong J1 = j._uint1; ulong J2 = j._uint2; ulong J3 = j._uint3;
 ulong R0 = 0; ulong R1 = 0; ulong R2 = 0; ulong R3 = 0;

 if (I0 != 0) {
   R0 += I0 * J0;
   R1 += I0 * J1;
   R2 += I0 * J2;
   R3 += I0 * J3;
 }
 if (I1 != 0) {
   R1 += I1 * J0;
   R2 += I1 * J1;
   R3 += I1 * J2;
 }
 if (I2 != 0) {
   R2 += I2 * J0;
   R3 += I2 * J1;
 }
 R3 += I3 * J0;

 R1 += R0 >> 32; R0 &= uint.MaxValue;
 R2 += R1 >> 32; R1 &= uint.MaxValue;
 R3 += R2 >> 32; R2 &= uint.MaxValue;
 R3 &= uint.MaxValue;

 return new UInt128((uint)R3, (uint)R2, (uint)R1, (uint)R0);
}

Stackalloc Version Stackalloc 版本

The [0 + 1] , [1 + 1] , etc. are left for clarity only. [0 + 1][1 + 1]等仅为了清楚起见而保留。 They will be optimized by C# compiler into constants anyways.无论如何,它们都会被 C# 编译器优化为常量。

public unsafe static UInt128 operator *(UInt128 i, UInt128 j) {

  var I = stackalloc ulong[4];
  var J = stackalloc ulong[4];
  var R = stackalloc ulong[4];

  I[0] = i._uint0; I[1] = i._uint1; I[2] = i._uint2; I[3] = i._uint3;
  J[0] = j._uint0; J[1] = j._uint1; J[2] = j._uint2; J[3] = j._uint3;


  if (I[0] != 0) {
    R[0] += I[0] * J[0];
    R[0 + 1] += I[0] * J[1];
    R[0 + 2] += I[0] * J[2];
    R[0 + 3] += I[0] * J[3];
  }
  if (I[1] != 0) {
    R[1] += I[1] * J[0];
    R[1 + 1] += I[1] * J[1];
    R[1 + 2] += I[1] * J[2];
  }
  if (I[2] != 0) {
    R[2] += I[2] * J[0];
    R[2 + 1] += I[2] * J[1];
  }
  R[3] += I[3] * J[0];


  R[1] += R[0] >> 32; R[0] &= uint.MaxValue;
  R[2] += R[1] >> 32; R[1] &= uint.MaxValue;
  R[3] += R[2] >> 32; R[2] &= uint.MaxValue;
  R[3] &= uint.MaxValue;

  return new UInt128((uint)R[3], (uint)R[2], (uint)R[1], (uint)R[0]);
}

For some reason the "variable" version seems to be ~20% faster than the "stackalloc" version on both x86 and x64 (with optimizations) using C# 7.2 compiler running on .NET 4.6.1.由于某种原因,使用在 .NET 4.6.1 上运行的 C# 7.2 编译器,“可变”版本似乎比 x86 和 x64(经过优化)上的“stackalloc”版本快 20%。 Haven't checked the performance on newer/older frameworks but suspect it will be similar, so my question is not specific to 4.6.1 only, as it seems to be generally the case that stackalloc is slower.尚未检查较新/较旧框架的性能,但怀疑它会相似,所以我的问题不仅仅针对 4.6.1,因为通常情况下 stackalloc 似乎更慢。

Is there any reason that the stackalloc version is slower considering that both version allocate exactly the same amount of memory ( 12 * sizeof(ulong) ) and perform exactly the same operations in the same order?考虑到两个版本分配完全相同的内存量( 12 * sizeof(ulong) )并以相同的顺序执行完全相同的操作,是否有任何原因导致 stackalloc 版本变慢? I would really prefer to work with arrays via stackalloc instead of variables.我真的更喜欢通过 stackalloc 而不是变量来处理数组。

variable version(simplified): https://sharplab.io/#v2:C4LghgzgtgPgAgZgAQWAJwK4GNhIKoCSAdsAIwBMAHAN4CwAUEk0hgDYD2RA5kgPoYBLEgAYA3A2YsO3PoJKlxjZm049+Q4OUWSVM9SQTbmifMTJUAFHNxgANCw1IARvetIsrxwBMAlHSWSTPrAwkgAvEhgRoGyGqThztGBweQJWEmSwcgRXkkAvgyF9CZwpABspvJUSOwADgCmaGDA7GhIAFQWhFWUSAL23ea9AFY+SP4SujwEoRECAHTBYkiT0jwAUrNIw4vWy6uqSABKW/v0EgIAZkgWM0gAhBEUCAAsY/7MJ0gA1BF37UhNooJAVzmCkHAAOxIIj1ADulSGFisGh8wnsKJIaIx1mxN1xJx8ilBkjyQA= variable version(simplified): https://sharplab.io/#v2:C4LghgzgtgPgAgZgAQWAJwK4GNhIKoCSAdsAIwBMAHAN4CwAUEk0hgDYD2RA5kgPoYBLEgAYA3A2YsO3PoJKlxjZm049+Q4OUWSVM9SQTbmifMTJUAFHNxgANCw1IARvetIsrxwBMAlHSWSTPrAwkgAvEhgRoGyGqThztGBweQJWEmSwcgRXkkAvgyF9CZwpABspvJUSOwADgCmaGDA7GhIAFQWhFWUSAL23ea9AFY+SP4SujwEoRECAHTBYkiT0jwAUrNIw4vWy6uqSABKW/v0EgIAZkgWM0gAhBEUCAAsY/7MJ0gA1BF37UhNooJAVzmCkHAAOxIIj1ADulSGFisGh8wnsKJIaIx1mxN1xJx8ilBkjyQA=

array version(simplified): https://sharplab.io/#v2:C4LghgzgtgPgAgZgAQWAJwK4GNhIKoCSAdsAIwBMAHAN4CwAUEk0hgDYD2RA5kgPoYBLEgAYA3A2YsO3PoJKlxjZm049+Q4OUWSVM9SQTbmifMTJUAFHNxgANCw1IARvetIsrxwBMAlHSWSTPrAwkgAvEhgRoGyGqThztGBweQJWEmSwcgRXkkAvgyF9CYYRBBgAGYApkhwpABspvJUSOwADlVoYMDsaEgAVBaEzZRIAvbD5qMAVj5I/hJIAG5gfQQJqGBYANZgrBxYUqoA2qQAukYrfQBSG8Bbu/vsh7pcpxeLV0gASncPewcjtx3opFgRjsIzgkBAA6YJiJCLa4QqERaZw6xiIqLAQVJAWcGQpAAQgiFAQABY5v5JN8UUgANQRQlQ/pIZGQowFeiLOAAdiQRCqAHcmlMLFYND5hPZJSRpbLrAr8Uq6ZCfIpuZI8kA= array version(simplified): https://sharplab.io/#v2:C4LghgzgtgPgAgZgAQWAJwK4GNhIKoCSAdsAIwBMAHAN4CwAUEk0hgDYD2RA5kgPoYBLEgAYA3A2YsO3PoJKlxjZm049+Q4OUWSVM9SQTbmifMTJUAFHNxgANCw1IARvetIsrxwBMAlHSWSTPrAwkgAvEhgRoGyGqThztGBweQJWEmSwcgRXkkAvgyF9CYYRBBgAGYApkhwpABspvJUSOwADlVoYMDsaEgAVBaEzZRIAvbD5qMAVj5I/hJIAG5gfQQJqGBYANZgrBxYUqoA2qQAukYrfQBSG8Bbu/vsh7pcpxeLV0gASncPewcjtx3opFgRjsIzgkBAA6YJiJCLa4QqERaZw6xiIqLAQVJAWcGQpAAQgiFAQABY5v5JN8UUgANQRQlQ/pIZGQowFeiLOAAdiQRCqAHcmlMLFYND5hPZJSRpbLrAr8Uq6ZCfIpuZI8kA=

from array version IL(L0009 - L004E), variable version is just using register, but array version is using stack.Althrough the data fits in CPU cache, but still slower than register.从数组版本 IL(L0009 - L004E) 开始,变量版本仅使用寄存器,但数组版本使用堆栈。虽然数据适合 CPU 缓存,但仍然比寄存器慢。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM