Theano: Discrepancy vs Numpy when taking matrix dot with CPU vs GPU

I recently got Theano working on Windows 10 with CUDA v7.5, CUDNN v3, and Visual Studio 2013 Community Edition. In order to verify it was working correctly, I tested the following code from the Theano Windows install page using both CPU and GPU:

import numpy as np
import time
import theano
A = np.random.rand(10000,10000).astype(theano.config.floatX)
B = np.random.rand(10000,10000).astype(theano.config.floatX)
np_start = time.time()
AB = A.dot(B)
np_end = time.time()
X,Y = theano.tensor.matrices('XY')
mf = theano.function([X,Y],X.dot(Y))
t_start = time.time()
tAB = mf(A,B)
t_end = time.time()
print "NP time: %f[s], theano time: %f[s] (times should be close when run on CPU!)" %(
                                           np_end-np_start, t_end-t_start)
print "Result difference: %f" % (np.abs(AB-tAB).max(), )

I got the following results:

G:\ml\Theano\Projects>python Test.py
NP time: 10.585000[s], theano time: 10.587000[s] (times should be close when run on CPU!)
Result difference: 0.000000

G:\ml\Theano\Projects>python Test.py
Using gpu device 0: GeForce GTX 970 (CNMeM is disabled)
NP time: 10.838000[s], theano time: 1.294000[s] (times should be close when run on CPU!)
Result difference: 0.022461

As you can see, there is a fairly significant difference of 0.022 when doing the calculation on GPU. Just wondering whether this is to be expected or I am doing something wrong.

Here is my .theanorc:

device = gpu
floatX = float32

fastmath = True

The GPU doesn't do the addition and multiplication in the same order. As floats are not exact, it is normal to see some differences.

An absolute difference of that size can be normal if the relative difference is small.

To compare them more "correctly" use theano.tensor.basic._allclose(result1, result2)

