张量流中的触发器：矩阵乘法

Question

Inspired by this question I tried to measure the FLOPS required by tensorflow for a matrix-matrix multiplication. 受这个问题的启发，我尝试测量张量流对于矩阵矩阵乘法所需的FLOPS。

For two matrices A and B with sizes (mxp) and (pxn), respectively, the resulting matrix C=AB with size (mxn) has mn entries. 对于大小分别为（mxp）和（pxn）的两个矩阵A和B，结果矩阵C = AB（大小为（mxn））具有mn个条目。 For each entry, p multiplications and (p-1) summations are required. 对于每个条目，都需要p个乘法和（p-1）个求和。 Hence, the total number of operations is mn(2p-1) . 因此，操作总数为mn(2p-1) 。

With the code from the linked question/answer, tensorflow outputs m*n*2p , see code below. 使用链接的问题/答案中的代码，tensorflow输出m*n*2p ，请参见下面的代码。

Why is this approximation returned and not the theoretical value? 为什么返回这种近似值而不是理论值？ In the worst case, p=1, this approximation is factor 2 larger than the correct value. 在最坏的情况下，p = 1，此近似值比正确值大2倍。

import numpy as np
import tensorflow as tf
g = tf.Graph()
run_meta = tf.RunMetadata()
with g.as_default():
    A=tf.convert_to_tensor(np.random.rand(13,9))
    B=tf.convert_to_tensor(np.random.rand(9,7))
    C = tf.matmul(A,B) # shape=[13,7]

    opts = tf.profiler.ProfileOptionBuilder.float_operation()    
    flops = tf.profiler.profile(g, run_meta=run_meta, cmd='op', options
=opts)
    if flops is not None:
        print('Flops should be ', 13*7*(2*9-1))
        print('Approximation 2*13*7*9=',2*13*7*9) 
        print('TF stats gives',flops.total_float_ops)

#Output: 
#Flops should be  1547
#Approximation 2*13*7*9= 1638
#TF stats gives 1638

Answer 1

I think this is because in practice, summations are often coded like this (pseudo-code below): 我认为这是因为在实践中，求和通常是这样编码的（下面的伪代码）：

total = 0
for i in 0...p
  total += x[i] * y[i]

that is, the first element x[0] * y[0] is summed to total (which is 0 then), which yields p summations rather than p-1 . 就是说，第一个元素x[0] * y[0]求和（ total为0），得出p求和而不是p-1 。

You could try to be smart and avoid this extra summation: 您可以尝试变得聪明，避免这种额外的总结：

total = x[0] * y[0]
for i in 1...p
  total += x[i] * y[i]

... but then what happens if p==0 ? ...但是如果p==0会发生什么呢？ Ouch we need to add an extra comparison: 哎呀，我们需要添加一个额外的比较：

if p > 0
  total = x[0] * y[0]
  for i in 1...p
    total += x[i] * y[i]
else
  total = 0

The thing is, this comparison is not a flop and will not appear in your flop count -- yet in practice it is as costly, if not more costly, than a simple add. 事实是，这种比较不是失败，而且不会出现在您的失败计数中-但实际上，它比简单的添加成本更高，甚至更高。

Bottom line: 底线：

The flop calculation is probably correct if the implementation does not "optimize away" the initial sum 如果实现未“优化”初始和，则翻牌计算可能是正确的
This "optimization" may actually not speed up you code 这种“优化”实际上可能不会加快代码速度
Take flops measures with a grain of salt, and don't worry too much about vanishing components. 用一粒盐做拖鞋的措施，不要太担心消失的部件。

Answer 2

I'm not sure why but I think this is the "coded" theoretical value: 我不确定为什么，但是我认为这是“编码”理论值：

...

@ops.RegisterStatistics("MatMul", "flops")
def _calc_mat_mul_flops(graph, node):
  """Calculates the compute resources needed for MatMul."""
  transpose_a = node.attr["transpose_a"].b
  a_shape = graph_util.tensor_shape_from_node_def_name(graph, node.input[0])
  a_shape.assert_is_fully_defined()
  if transpose_a:
    k = int(a_shape[0])
  else:
    k = int(a_shape[1])
  output_shape = graph_util.tensor_shape_from_node_def_name(graph, node.name)
  output_shape.assert_is_fully_defined()
  output_count = np.prod(output_shape.as_list())
  return ops.OpStats("flops", (k * output_count * 2))

...

张量流中的触发器：矩阵乘法

问题描述

2 个解决方案

解决方案1
2 已采纳 2018-05-11 08:29:25

解决方案2
0 2018-05-10 21:25:18

张量流中的触发器：矩阵乘法

问题描述

2 个解决方案

解决方案1 2 已采纳 2018-05-11 08:29:25

解决方案2 0 2018-05-10 21:25:18

解决方案1
2 已采纳 2018-05-11 08:29:25

解决方案2
0 2018-05-10 21:25:18