Tensorflow Eager Execution GPU count_nonzero NotFoundError

Question

According to an answer in this thread ( NotFoundError on OpKernel when using tf.nn.embedding_lookup in tensorflow eager mode ) some ops are not implemented on GPU yet.

I have a problem with an op, where I also get an NotFoundError , but the error-message confuses me. Here my sample-code with Tensorflow 1.10. I know that I can ommit the device-forcing and tensorflow will run the operation on CPU, but I would like to do as much on GPU as possible.

import tensorflow as tf

tf.enable_eager_execution()
print("Eager execution: {}".format(tf.executing_eagerly()))

device = 'gpu:0'
with tf.device(device):

    x = tf.constant([195330., 195075., 173910., 167535., 167535., 170340., 206040., 175185., 206040.,
                     118575., 214710., 171870., 204765., 202215.,      0.,      0.,      0.,      0.,
                     0.,      0.], dtype=tf.float32)

    print(tf.count_nonzero(x))

I get the following error:

python3 test.py 
Eager execution: True
2018-09-28 14:41:51.186066: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: FMA
2018-09-28 14:41:51.370081: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1405] Found device 0 with properties: 
name: GeForce GTX 1060 6GB major: 6 minor: 1 memoryClockRate(GHz): 1.7335
pciBusID: 0000:01:00.0
totalMemory: 5.93GiB freeMemory: 5.38GiB
2018-09-28 14:41:51.467475: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1405] Found device 1 with properties: 
name: GeForce GT 730 major: 3 minor: 5 memoryClockRate(GHz): 0.9015
pciBusID: 0000:02:00.0
totalMemory: 1.96GiB freeMemory: 1.93GiB
2018-09-28 14:41:51.467534: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1469] Ignoring visible gpu device (device: 1, name: GeForce GT 730, pci bus id: 0000:02:00.0, compute capability: 3.5) with Cuda multiprocessor count: 2. The minimum required count is 8. You can adjust this requirement with the env var TF_MIN_GPU_MULTIPROCESSOR_COUNT.
2018-09-28 14:41:51.467543: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1484] Adding visible gpu devices: 0
2018-09-28 14:41:51.848119: I tensorflow/core/common_runtime/gpu/gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-09-28 14:41:51.848172: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971]      0 1 
2018-09-28 14:41:51.848195: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 0:   N N 
2018-09-28 14:41:51.848206: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 1:   N N 
2018-09-28 14:41:51.848446: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 5143 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1060 6GB, pci bus id: 0000:01:00.0, compute capability: 6.1)
Traceback (most recent call last):
  File "test.py", line 13, in <module>
    print(tf.count_nonzero(x))
  File "/home/joe/.local/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 454, in new_func
    return func(*args, **kwargs)
  File "/home/joe/.local/lib/python3.6/site-packages/tensorflow/python/ops/math_ops.py", line 1384, in count_nonzero
    reduction_indices=reduction_indices),
  File "/home/joe/.local/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 454, in new_func
    return func(*args, **kwargs)
  File "/home/joe/.local/lib/python3.6/site-packages/tensorflow/python/ops/math_ops.py", line 1307, in reduce_sum
    name=name))
  File "/home/joe/.local/lib/python3.6/site-packages/tensorflow/python/ops/gen_math_ops.py", line 8283, in _sum
    _six.raise_from(_core._status_to_exception(e.code, message), None)
  File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.NotFoundError: No registered 'Sum' OpKernel for GPU devices compatible with node Sum = Sum[T=DT_INT64, Tidx=DT_INT32, keep_dims=false](dummy_input, dummy_input)
     (OpKernel was found, but attributes didn't match)
    .  Registered:  device='CPU'; T in [DT_COMPLEX128]; Tidx in [DT_INT64]
  device='CPU'; T in [DT_COMPLEX128]; Tidx in [DT_INT32]
  device='CPU'; T in [DT_COMPLEX64]; Tidx in [DT_INT64]
  device='CPU'; T in [DT_COMPLEX64]; Tidx in [DT_INT32]
  device='CPU'; T in [DT_DOUBLE]; Tidx in [DT_INT64]
  device='CPU'; T in [DT_DOUBLE]; Tidx in [DT_INT32]
  device='CPU'; T in [DT_FLOAT]; Tidx in [DT_INT64]
  device='CPU'; T in [DT_FLOAT]; Tidx in [DT_INT32]
  device='CPU'; T in [DT_BFLOAT16]; Tidx in [DT_INT64]
  device='CPU'; T in [DT_BFLOAT16]; Tidx in [DT_INT32]
  device='CPU'; T in [DT_HALF]; Tidx in [DT_INT64]
  device='CPU'; T in [DT_HALF]; Tidx in [DT_INT32]
  device='CPU'; T in [DT_INT8]; Tidx in [DT_INT64]
  device='CPU'; T in [DT_INT8]; Tidx in [DT_INT32]
  device='CPU'; T in [DT_UINT8]; Tidx in [DT_INT64]
  device='CPU'; T in [DT_UINT8]; Tidx in [DT_INT32]
  device='CPU'; T in [DT_INT16]; Tidx in [DT_INT64]
  device='CPU'; T in [DT_INT16]; Tidx in [DT_INT32]
  device='CPU'; T in [DT_UINT16]; Tidx in [DT_INT64]
  device='CPU'; T in [DT_UINT16]; Tidx in [DT_INT32]
  device='CPU'; T in [DT_INT32]; Tidx in [DT_INT64]
  device='CPU'; T in [DT_INT32]; Tidx in [DT_INT32]
  device='CPU'; T in [DT_INT64]; Tidx in [DT_INT64]
  device='CPU'; T in [DT_INT64]; Tidx in [DT_INT32]
  device='GPU'; T in [DT_INT32]; Tidx in [DT_INT64]
  device='GPU'; T in [DT_INT32]; Tidx in [DT_INT32]
  device='GPU'; T in [DT_COMPLEX128]; Tidx in [DT_INT64]
  device='GPU'; T in [DT_COMPLEX128]; Tidx in [DT_INT32]
  device='GPU'; T in [DT_COMPLEX64]; Tidx in [DT_INT64]
  device='GPU'; T in [DT_COMPLEX64]; Tidx in [DT_INT32]
  device='GPU'; T in [DT_DOUBLE]; Tidx in [DT_INT64]
  device='GPU'; T in [DT_DOUBLE]; Tidx in [DT_INT32]
  device='GPU'; T in [DT_FLOAT]; Tidx in [DT_INT64]
  device='GPU'; T in [DT_FLOAT]; Tidx in [DT_INT32]
  device='GPU'; T in [DT_HALF]; Tidx in [DT_INT64]
  device='GPU'; T in [DT_HALF]; Tidx in [DT_INT32]
 [Op:Sum]

As far as I understand the error

No registered 'Sum' OpKernel for GPU devices compatible with node Sum = Sum[T=DT_INT64, Tidx=DT_INT32, keep_dims=false](dummy_input, dummy_input)

it looks for an implementation for T=DT_INT64, Tidx=DT_INT32 , but the Tensor is from type float32 . Do I miss something?

Answer 1

I've substituted the count_nonzero by a combination of greater and reduce_sum (casting the boolean array from the greater-op to float32). Now it works on GPU:

print(tf.reduce_sum(tf.cast(tf.greater(x, 0), tf.float32)))

Answer 2

Look at the implementation here :

def count_nonzero(input_tensor,
                  axis=None,
                  keepdims=None,
                  dtype=dtypes.int64,
                  name=None,
                  reduction_indices=None,
keep_dims=None):
  keepdims = deprecation.deprecated_argument_lookup("keepdims", keepdims,
                                                    "keep_dims", keep_dims)
  if keepdims is None:
    keepdims = False

  with ops.name_scope(name, "count_nonzero", [input_tensor]):
    input_tensor = ops.convert_to_tensor(input_tensor, name="input_tensor")
    # A scalar of 'zero' is enough as `not_equal` will broadcast.
    zero = array_ops.zeros([], dtype=input_tensor.dtype)
    return cast(
        reduce_sum(
            # int64 reduction happens on GPU
            to_int64(gen_math_ops.not_equal(input_tensor, zero)),
            axis=axis,
            keepdims=keepdims,
            reduction_indices=reduction_indices),
dtype=dtype)

Note how there is a cast to int64 , before the reduce_sum is called. THat is why TF searches for an operation on int64 instead of your original float32.

Tensorflow Eager Execution GPU count_nonzero NotFoundError

Question

2 answers

solution1
1 2018-09-30 07:42:49

solution2
0 ACCPTED 2018-09-28 12:27:35

Tensorflow Eager Execution GPU count_nonzero NotFoundError

Question

2 answers

solution1 1 2018-09-30 07:42:49

solution2 0 ACCPTED 2018-09-28 12:27:35

solution1
1 2018-09-30 07:42:49

solution2
0 ACCPTED 2018-09-28 12:27:35