Tensorflow (GPU) vs. Numpy

Question

so I've got two implementations of a linear regression using gradient descent. One in Tensorflow, one in Numpy. I'm finding the one in Numpy is about 3x faster than in Tensorflow. Here's my code -

Tensorflow:

class network_cluster(object):
    def __init__(self, data_frame, feature_cols, label_cols):
        self.init_data(data_frame, feature_cols, label_cols)
        self.init_tensors()

    def init_data(self, data_frame, feature_cols, label_cols):
        self.data_frame = data_frame
        self.feature_cols = feature_cols
        self.label_cols = label_cols

    def init_tensors(self):
        self.features = tf.placeholder(tf.float32)
        self.labels = tf.placeholder(tf.float32)

        self.weights = tf.Variable(tf.random_normal((len(self.feature_cols), len(self.label_cols))))
        self.const = tf.Variable(tf.random_normal((len(self.label_cols),)))

    def linear_combiner(self):
        return tf.add(tf.matmul(self.features, self.weights), self.const)

    def predict(self):
        return self.linear_combiner()

    def error(self):
        return tf.reduce_mean(tf.pow(self.labels - self.predict(), 2), axis = 0)

    def learn_model(self, epocs = 100):
        optimizer = tf.train.AdadeltaOptimizer(1).minimize(self.error())

        error_rcd = []
        with tf.Session() as sess:
            sess.run(tf.global_variables_initializer())
            for epoc in range(epocs):
                _, error = sess.run([optimizer, self.error()], feed_dict={
                    self.features: self.data_frame[self.feature_cols],
                    self.labels: self.data_frame[self.label_cols]
                })
                error_rcd.append(error[0])

        return error_rcd

    def get_coefs(self):
        with tf.Session() as sess:
            sess.run(tf.global_variables_initializer())

            coefs = sess.run([self.weights, self.const])

        return coefs

test_cluster = network_cluster(dataset, ['ship_jumps', 'npc_kills', 'ship_kills', 'pod_kills'], ['hour_of_week'])
%timeit test_cluster.learn_model(epocs = 100)

And numpy:

def grad_descent(dataset, features, predictor, max_iters = 10000):

    def initialize_model(dataset, features, predictor):
        constant_array = np.ones(shape = (len(dataset), 1))
        features_array = dataset.loc[:, features].values
        features_array = np.append(constant_array, features_array, axis = 1)
        predict_array = dataset.loc[:, predictor].values
        betas = np.zeros(shape = (len(features) + 1, len(predictor)))
        return (features_array, predict_array, betas)

    def calc_gradient(features_array, predict_array, betas):
        prediction = np.dot(features_array, betas)
        predict_error = predict_array - prediction
        gradient = -2 * np.dot(features_array.transpose(), predict_error)
        gradient_two = 2 * np.expand_dims(np.sum(features_array ** 2, axis = 0), axis = 1)
        return (gradient, gradient_two)

    def update_betas(gradient, gradient_two, betas):
        new_betas = betas - ((gradient / gradient_two) / len(betas))
        return new_betas

    def model_error(features_array, predict_array, betas):
        prediction = np.dot(features_array, betas)
        predict_error = predict_array - prediction
        model_error = np.sqrt(np.mean(predict_error ** 2))
        return model_error

    features_array, predict_array, betas = initialize_model(dataset, features, predictor)
    prior_error = np.inf
    for iter_count in range(max_iters):
        gradient, gradient_two = calc_gradient(features_array, predict_array, betas)
        betas = update_betas(gradient, gradient_two, betas)
        curr_error = model_error(features_array, predict_array, betas)
        if curr_error == prior_error:
            break
        prior_error = curr_error
    return (betas, iter_count, curr_error)

%timeit grad_descent(dataset, ['ship_jumps', 'npc_kills', 'ship_kills', 'pod_kills'], ['hour_of_week'], max_iters = 100)

I'm testing using the Spyder IDE, and I do have an Nvidia GPU (960). The Tensorflow code clocks in at ~20 seconds, with the Numpy code at about 7 seconds on the same dataset. The dataset is almost 1 million rows.

I would have expected Tensorflow to beat out Numpy handily here, but that's not the case. Granted I am new to using Tensorflow, and the Numpy implementation doesn't use a class, but still, 3x better with Numpy?!

Hoping for some thoughts/ideas on what I'm doing wrong here.

Answer 1

Without looking at your code in detail (not that much experience with TF):

This comparison is flawed !

Yaroslav's comment is of course true: GPU-computing has some overhead (at least data-preparation; not sure what kind of compiling is clocked here)
You are comparing pure GD to Adadelta in full-batch mode it seems:
- Adadelta of course implicates some overhead (there are more operations than calculating the gradient and multiplying the current iterate) as it's one of the common variance-reduction methods which come with a price!
  - The idea is: invest some additional operations to:
    - remove the number of iterations needed given some learning-rate
    - (this is even much more complex: for most people -> achieve good convergence with using the default learning-rates)
It seems you are just running 100 epochs each and clocking this
- That's not meaningful!
  - It's very much possible that the objective is very different:
    - if iteration size is not enough
    - or the initial learning-rate is badly chosen
  - or the same, but the non-existing early-stopping made sure a possible better algorithm with proven convergence (according to some criterion) wastes some additional time doing all iterations until 100 is reached!
(Adadelta was probably designed for the SGD-setting; not GD)

It's very hard to compare such different algorithms, especially when using just one task / dataset.

Even if you would introduce early-stopping, you will observe random-seed -based indeterministic performance which is hard to interpret.

You are basically measuring iteration-time, but this is not a good measure. Compare first-order methods (gradients -> SGD, GD, ...) with second-order methods (hessian -> Newton). the latter is very slow to iterate, but usually obtains quadratic convergence behaviour resulting in way less iterations needed! In NN-applications this example is more: LBFGS vs. SGD/... (although i don't know if LBFGS is available in TF; torch supports it). LBFGS is known to achieve local-quadratic convergence which is again hard to interpret in real-world tasks (especially as this limited-memory approximation of the inverse-hessian is a parameter of LBFGS). This comparison can also be done on Linear-Programming where the Simplex-method has fast-iterations while Interior-point methods (basically Newton-based; but treating constrained-optimization here there are some additional ideas needed) are much slower per iteration (despite being faster to achieve convergence in many cases).

What i ignored here: nearly all theoretical results regarding convergence and co. are limited to convex and smooth functions. NNs are typically non-convex, which means, the task of evaluating these performance-measures is even harder. But your problem here is convex of course.

I also have to admit, that my answer is only scratching the surface of this complex problem, even if unconstrained smooth convex optimization is one of the easier tasks in numerical-optimization (compared to constrained, nonsmooth nonconvex optimization).

For a general introduction to numerical-optimization, which also talks a lot about first-order vs. second-order methods (and there are many methods in-between), i recommend Numerical Optimization by Nocedal and Wright which can be found on the web.

Tensorflow (GPU) vs. Numpy

Question

1 answers

solution1
6 2017-09-18 03:29:08

Tensorflow (GPU) vs. Numpy

Question

1 answers

solution1 6 2017-09-18 03:29:08

solution1
6 2017-09-18 03:29:08