简体   繁体   中英

FORTRAN is faster than C - for a matrix multiplication program running on a same processor, why?

I was running n*n matrix multiplication code using C and FORTRAN on a xeon processor system. I was surprised to see the real time difference between the two approaches. Why did the FORTRAN code gave me a faster execution time? I was using dgemm() and called the same function from my C code. I tried to run the general C code changing the loop order and trying with different flags for optimizing the simulation process. I couldn't reach the same response obtained by using dgemm() .

FORTRAN code - dgemm():

#include "stdio.h"
#include "time.h"
#include "sys/time.h"
#include "math.h"
#include "stdlib.h"

long long readTSC(void)
{
 /* read the time stamp counter on Intel x86 chips */
  union { long long complete; unsigned int part[2]; } ticks;
  __asm__ ("rdtsc; mov %%eax,%0;mov %%edx,%1"
        : "=mr" (ticks.part[0]),
          "=mr" (ticks.part[1])
        : /* no inputs */
        : "eax", "edx");
 return ticks.complete;
}
volatile double gtod(void)
{
 static struct timeval tv;
 static struct timezone tz;
 gettimeofday(&tv,&tz);
 return tv.tv_sec + 1.e-6*tv.tv_usec;
}

void dgemm (char *transa, char *transb, int *x, int *xa, int *xb, double    *alphaa, double *ma, int *xc, double *mb, int *xd, double *betaa, double *msum,   int *xe);
 int main(int argc, char** argv)
 {
   int n = atoi(argv[1]);
   long long tm;

  //disabling transpose, disabling addition operation in C :=       alpha*op(A)*op(B) + beta*C
 char trans='N';
 double alpha=1.0;
 double beta=0.0;


 long long int p=2*n*n*n;
 long double q;
 double *a,*b,*sum;
 double t_real,t,flop_clk,flops;
 int i,j,k;

 //memory allocation
 a=(double*)malloc(n*n*sizeof(double));
 b=(double*)malloc(n*n*sizeof(double));
 sum=(double*)malloc(n*n*sizeof(double));

 //Matrix Initialization
 for (i=0;i<n;i++)
  {
    for (j=0;j<n;j++)
    {
       a[i+n*j]=(double)rand();
      b[i+n*j]=(double)rand();
      sum[i+n*j]=0.0;
    }
 }

//Clock cycles computation using timing2 function and t_real using timing1   function
  t = gtod();

 tm = readTSC();
//dgemm function call
 dgemm(&trans, &trans, &n, &n, &n, &alpha, a, &n, b, &n, &beta, sum, &n);
 tm = readTSC() - tm;
 t_real = gtod() - t;
 return 0;
 }

C Code simply take sum=0 and then

for (i=0;i<n;i++)
{
  for (k=0;k<n;k++)
  {
    for (j=0;j<n;j++)
    {
      sum [i+n*j] +=a[i+n*k]*b[k+n*j];
    }
  }
} 

Compilation:

  • icc –o executable program.c for the C code

  • icc -o executable program.c mkl=sequential for the Fortran one

Performance

With matrix order of 5000*5000, I got 4.2 GFLOPS in my code and 21.7 GFLOPS using dgemm().

You still do not show enough for a definitive answer. Notably, in any question about performance when you say that something is faster, you should show the actual measurements you did and the commands you used to compiled the executables.

Anyway, some conclusions can be made.

  1. You appear not to use any optimization (the -O or -fast flags). Any performance analysis is essentially pointless then.

  2. From the source code you showed it is cleare that you do not compare the same thing at all, you are comparing two different algorithms. There is absolutely no point comparing the speed of two different algorithms. gemm does not contain such simple loops you use in your own code, it is much more complicated mainly for optimal cache utilization.

  3. You use the very naive way of multiplying the matrices in your own C code. The fact that you are now (according to one of your comments) now faster than gemm is actually very worrying. Are you sure you used large enough matrices? There is no point calling gemm on matrices 10x10, they should have some substantial size. gemm should be much faster than the naive loops for sufficiently sized matrices. The original figures of 4.2 and 22 GFLOPS sound reasonable if you do not use any compiler optimization for your own function.

  4. You claim you are comparing with Fortran. This is NOT true. Only the reference BLAS implementation is written in Fortran but it is not used for serious computation where fast BLAS actually is needed. MKL, which you appear to be using, is not written in Fortran, it is a very optimized assembly code. There are other implementations of BLAS available (ATLAS, GotoBLAS, OpenBLAS) and they are normally not written in Fortran, but in C or assembly.

Just a guess, since the OP is not showing any code. If he is calling dgemm (from LAPACK BLAS), it is probably written in Fortran.

Pointer aliasing rules are different in C and in Fortran.

You might use (with care!) the restrict keyword when declaring formals in your C routines. This should help.

Also, arithmetic is different in C and in Fortran. In some dialects of C (eg C89), every floating point operation is computed on double-precision numbers. IIRC it is defined differently in Fortran. And it has changed between C89 & C99 (and perhaps also in C11).

If your two codes are compiled by a recent GCC (ie using gcc -O2 foo.c for C code, and gfortran -O2 foo.f90 for Fortran90 code) the two compilers produces both similar internal representation (Gimple, you might get it with -fdump-tree-ssa or many other -fdump flags which produce hundreds of dump files...) and then are optimizing it. So in that case the compiler backends are the same, the middle end is quite similar, but the front ends are really different.

You could simply look at the assembly code (using gcc -O2 -fverbose-asm & gfortran -O2 -fverbose-asm ) and spot the differences.

You might use additional options like -ffast-math (which enables the compiler to optimize even against the standard) or -mtune=native (which asks the GCC compiler to optimize for your particular processor) in addition of -O2 or -O3 optimization flags...

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM