[英].NET 4.6 RC x64 is twice as slow as x86 (release version)
Net 4.6 RC x64的速度是x86(發布版本)的兩倍:
考慮一下這段代碼:
class SpectralNorm
{
public static void Main(String[] args)
{
int n = 5500;
if (args.Length > 0) n = Int32.Parse(args[0]);
var spec = new SpectralNorm();
var watch = Stopwatch.StartNew();
var res = spec.Approximate(n);
Console.WriteLine("{0:f9} -- {1}", res, watch.Elapsed.TotalMilliseconds);
}
double Approximate(int n)
{
// create unit vector
double[] u = new double[n];
for (int i = 0; i < n; i++) u[i] = 1;
// 20 steps of the power method
double[] v = new double[n];
for (int i = 0; i < n; i++) v[i] = 0;
for (int i = 0; i < 10; i++)
{
MultiplyAtAv(n, u, v);
MultiplyAtAv(n, v, u);
}
// B=AtA A multiplied by A transposed
// v.Bv /(v.v) eigenvalue of v
double vBv = 0, vv = 0;
for (int i = 0; i < n; i++)
{
vBv += u[i] * v[i];
vv += v[i] * v[i];
}
return Math.Sqrt(vBv / vv);
}
/* return element i,j of infinite matrix A */
double A(int i, int j)
{
return 1.0 / ((i + j) * (i + j + 1) / 2 + i + 1);
}
/* multiply vector v by matrix A */
void MultiplyAv(int n, double[] v, double[] Av)
{
for (int i = 0; i < n; i++)
{
Av[i] = 0;
for (int j = 0; j < n; j++) Av[i] += A(i, j) * v[j];
}
}
/* multiply vector v by matrix A transposed */
void MultiplyAtv(int n, double[] v, double[] Atv)
{
for (int i = 0; i < n; i++)
{
Atv[i] = 0;
for (int j = 0; j < n; j++) Atv[i] += A(j, i) * v[j];
}
}
/* multiply vector v by matrix A and then by matrix A transposed */
void MultiplyAtAv(int n, double[] v, double[] AtAv)
{
double[] u = new double[n];
MultiplyAv(n, v, u);
MultiplyAtv(n, u, AtAv);
}
}
在我的機器上,x86發行版需要4.5秒才能完成,而x64需要9.5秒。 是否需要x64的特定標志/設置?
UPDATE
事實證明,RyuJIT在這個問題上發揮了作用。 如果在app.config中啟用了useLegacyJit
,則結果會有所不同,這次x64更快。
<?xml version="1.0" encoding="utf-8"?>
<configuration>
<startup>
<supportedRuntime version="v4.0" sku=".NETFramework,Version=v4.6"/>
</startup>
<runtime>
<useLegacyJit enabled="1" />
</runtime>
</configuration>
UPDATE
在GitHub上回答了perf回歸的原因; 簡而言之,它似乎只在英特爾而不是在Amd64機器上重現。 內循環操作
Av[i] += v[j] * A(i, j);
結果是
IN002a: 000093 lea eax, [rax+r10+1]
IN002b: 000098 cvtsi2sd xmm1, rax
IN002c: 00009C movsd xmm2, qword ptr [@RWD00]
IN002d: 0000A4 divsd xmm2, xmm1
IN002e: 0000A8 movsxd eax, edi
IN002f: 0000AB movaps xmm1, xmm2
IN0030: 0000AE mulsd xmm1, qword ptr [r8+8*rax+16]
IN0031: 0000B5 addsd xmm0, xmm1
IN0032: 0000B9 movsd qword ptr [rbx], xmm0
Cvtsi2sd對較低的8字節進行部分寫操作,xmm寄存器的高位字節未經修改。 對於repro情況,xmm1是部分寫入的,但是代碼中還有xmm1的進一步用途。 這會在cvtsi2sd和使用xmm1的其他指令之間產生錯誤的依賴關系,這會影響指令的並行性。 確實修改了Int to Float的codegen以在cvtsi2sd修復perf回歸之前發出“xorps xmm1,xmm1”。
解決方法:如果我們在MultiplyAv / MultiplyAvt方法中的乘法運算中反轉操作數的順序,也可以避免Perf回歸
void MultiplyAv(int n, double[] v, double[] Av)
{
for (int i = 0; i < n; i++)
{
Av[i] = 0;
for (int j = 0; j < n; j++)
Av[i] += v[j] * A(i, j); // order of operands reversed
}
}
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.