简体繁体 English

CPU 上的 XLA——收益从何而来？

[英]XLA on CPU -- where do the gains come from?

原文 2020-11-13 13:54:25 0 1 gpu/ cpu/ gpgpu/ cpu-cache/ xla

I understand that XLA performs automatic kernel fusion for a computational graph, which comes handy in reducing memory bandwidth usage on a GPU.我知道 XLA 为计算图执行自动内核融合，这在减少 GPU 上的内存带宽使用方面非常方便。 What gains can one derive using XLA for a CPU?将 XLA 用于 CPU 可以获得什么收益？ Is it the same principle, in fusing computations and not writing intermediate results to the L1 cache?在融合计算而不将中间结果写入 L1 缓存时，它的原理是否相同？ I would appreciate a laymen's explanation.我很感激外行的解释。

1 个解决方案

Yes, basically it's what you said.是的，基本上就是你说的。

In general, the more information (or "context") you, as a compiler, have about a set of computations, the better you can optimize them.一般来说，作为编译器，您对一组计算的信息（或“上下文”）越多，优化它们的效果就越好。

As pointed out in the XLA page , the single most important feature of XLA is fusion .正如XLA 页面中所指出的， XLA最重要的一个特性是融合。
Instead of computing x + y*z as two separate operations, it can be computed as single fused-multiply-add operation.与其将x + y*z计算为两个单独的运算，不如将其计算为单个融合乘加运算。
This is not only faster (generally) but it also avoids intermediate results which may have smaller precision and need to be stored somewhere.这不仅更快（通常），而且还避免了可能精度较低且需要存储在某处的中间结果。

Probably the TensorFlow model works by taking a set of data from memory and performing one of a defined set of kernels on it, storing each partial result back in memory, so the next kernel can consume it.可能 TensorFlow 模型的工作原理是从内存中获取一组数据并在其上执行一组定义的内核中的一个，将每个部分结果存储回内存中，以便下一个内核可以使用它。
With XLA, linear algebra patterns are recognized and further optimized by combining one or more kernels together, avoiding an unnecessary back and forth from memory.使用 XLA，可以通过将一个或多个内核组合在一起来识别和进一步优化线性代数模式，避免不必要的来回内存。

Modern mainstream CPUs have support for "vectors" (in jargon: SIMD) and some do support LA operations as the GPUs do.现代主流 CPU 支持“向量”（行话：SIMD），有些确实像 GPU 一样支持 LA 操作。
So yes, it's the same principle (though GPUs can do a lot more LA operations in parallel, so the gain is bigger there).所以是的，这是相同的原理（尽管 GPU 可以并行执行更多的 LA 操作，因此那里的增益更大）。