Introduction

To start our engines we look for the first problem, the 2d Matrix multiplication. We will start with a short introduction to the problem and how to solve it sequentially with C.

Problem

So we need to multiply 2 matrix A (3x3), and B(3x2). The first thing that we need to verify is that the numbers of columns of A must match with the number of rows of B. The output matrix C will be (MxN) where M is the rows of A, and N the columns of B.

So for each row of A we multiply and add each row element of A with each column element of B. The result is the sum of those elements. Describing this on C:

void matrix_2d_mul(float *A, float *B, float *C, int num_rows_A, int num_cols_A, int num_cols_B) {
  float sum = 0;
  int num_rows_C = num_rows_A;
  int num_cols_C = num_cols_B;
  // Iterate on each row of A
  for(int i=0; i<num_rows_A; i++) {
    // Iterate on each collumn of B
    for (int k=0; k<num_cols_B; k++) {
      sum = 0;
      // Do the "multiply add between" row of A and collumn of B
      for (int j=0; j<num_cols_A; j++){
        // A[i][j] == A[i*num_cols_A+j]
        // B[j][k] == B[j*num_cols_B+k]
        //sum += A[i][j]*B[j][k];
        sum += A[i*num_cols_A+j]*B[j*num_cols_B+k];
      }
      // C[i][k] == C[i*num_cols_C+k]
      C[i*num_cols_C+k]=sum;
    }
  }
}

So for each element of C we iterate a complete row of A, a complete column of B, multiplying each element and summing them.

Just to remember on C/C++ matrices are stored on the Row-major order on memory, this contrast with other languages like Matlab/Fortran. Pay attention to this detail because you could also get performance penalties due to cache miss. In the end there is a reason why some languages choose column major order. (Memory Coalescing)

Our complete program output takes 140 ms to compute

leo@leodev:~/work/learningOpenCl/chap1$ time ./seq_matrix_mul 
Size in bytes A: 36
Size in bytes B: 24

Vector A size: 9 { 1.00, 2.00, 3.00, 4.00, 5.00, 6.00, 1.00, 3.00, 2.00,}
Reference result for C

Matrix 3x2
| 30.00  33.00 |
| 87.00  93.00 |
| 35.00  24.00 |

Calculated result for C

Matrix 3x2
| 30.00  33.00 |
| 87.00  93.00 |
| 35.00  34.00 |


real    0m0.143s
user    0m0.140s
sys    0m0.000s

Our Main loop

Our program is quite simple it's main loop basically copy some inputs (A,B) from somewhere to memory and then calculate the matrix multiplication.

  // ****************** Main loop *******************
  for (int idxLoop=1; idxLoop < 1000000; idxLoop++) {
    // Get input to some memory
    memcpy(A,A_ref,numBytesA);
    memcpy(B,B_ref,numBytesB);

    // Call our sequential algorithm
    matrix_2d_mul_float(A,B,C,num_rows_A,num_cols_A,num_cols_B);
  }

Profiling sequential algorithm

So considering that your sequential algorithm works we should start profiling your code we want to know the following things:

  • How much % of time your program takes executing this function
  • How much time in seconds does your function takes.

Using Calgrind

The first question can be solved using a tool called Callgrind. Here we show the instructions to use it

# Compile your program
g++ -g seq_matrix_mul.c -o seq_matrix_mul
# Execute and get profile information
valgrind --tool=callgrind ./seq_matrix_mul
# Display
kcachegrind callgrind.out.26375

Here is the result related to the source code.

Here is a map relating all your function relative times.

Basically we can now have the following assumptions:

  • Our matrix_2d_mul is called 999999 and it's responsible for 91.29% of the total program time
  • Inside our loop the instructions related to memory movement take 5% of the time

It's a very good indicator that our function need to be accelerated, but it's possible to be done on the GPU?

Using Gprof

Gprof it's an old tool used to measure the time in seconds spent in some functions. In order to use the gprof you need to recompile your program with the -pg option:

# Compile your program
g++ -pg seq_matrix_mul.c -o seq_matrix_mul

# Execute program to generate the gmon.out file
./seq_matrix_mul

# Create a text file
gprof ./seq_matrix_mul gmon.out > timeProf.txt

Your output would be something like this:

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ns/call  ns/call  name    
 92.11      0.11     0.11   999999   110.53   110.53  matrix_2d_mul_float(float*, float*, float*, int, int, int)
  8.37      0.12     0.01                             main
  0.00      0.12     0.00        2     0.00     0.00  displayMatrix2d(float*, int, int)
  0.00      0.12     0.00        1     0.00     0.00  _GLOBAL__sub_I_num_rows_A
  0.00      0.12     0.00        1     0.00     0.00  displayVec1d(float*, int, char*)
  0.00      0.12     0.00        1     0.00     0.00  __static_initialization_and_destruction_0(int, int)

Here we are interested on the time spent on the matrix_2d_mul function which is approximately 110.53 nanoseconds.

Now this give us a good parameter on how to speed-up this function and also some important constraints.

Using perf

For more accurate results we can use the linux tool called perf. This tool collect events using the kernel and hardware timers, so it's more precise than gprof.

leo@leodev:~/work/learningOpenCl/chap1$ perf stat ./seq_matrix_mul
Size in bytes A: 36
Size in bytes B: 24

Vector A size: 9 { 1.00, 2.00, 3.00, 4.00, 5.00, 6.00, 1.00, 3.00, 2.00,}
Reference result for C

Matrix 3x2
| 30.00  33.00 |
| 87.00  93.00 |
| 35.00  24.00 |

Calculated result for C

Matrix 3x2
| 30.00  33.00 |
| 87.00  93.00 |
| 35.00  34.00 |


 Performance counter stats for './seq_matrix_mul':

        141.792356      task-clock (msec)         #    0.989 CPUs utilized          
                42      context-switches          #    0.296 K/sec                  
                 4      cpu-migrations            #    0.028 K/sec                  
                55      page-faults               #    0.388 K/sec                  
       247,027,242      cycles                    #    1.742 GHz                    
                 0      stalled-cycles-frontend   #    0.00% frontend cycles idle   
                 0      stalled-cycles-backend    #    0.00% backend  cycles idle   
       750,140,592      instructions              #    3.04  insns per cycle        
        63,222,441      branches                  #  445.880 M/sec                  
             8,935      branch-misses             #    0.01% of all branches        

       0.143309844 seconds time elapsed

This means basically that our program executed in 141ms which means 247,027,242 cycles at 1.742 GHz. Now let's get some information of the hot-spots. For that we must first record events.

sudo perf record -g ./seq_matrix_mul
sudo perf report -g

This basically give the same information as Callgrind but you have a more accurate information due to the execution time and number of cycles.

Can we speed-up on the GPU?

So we mention on the previous chapter that even if the GPU takes 0 seconds to calculate the matrix multiplication we cannot run away from the CPU/GPU transfer latency. Now we know that the transfer time of the matrix A and B to the GPU plus the transfer time of the result matrix back to the CPU must be less than 110.53 nanoseconds.

On this particular problem Matrix A has 36 bytes, matrix B has 24 bytes and the result matrix 24 bytes. So we need to calculate the time to send 60 bytes and receive 24 bytes.

So consider the following CUDA program (I wrote in Cuda because it's easier to profile on my Nvidia GPU and also because it's less code compared to OpenCL, but they should be the same)

int main()
{
    const unsigned int BytesToSend=60;
    const unsigned int BytesToReceive=24;

    // Alocate memory on CPU
    char *hostArray= (char*)malloc(BytesToSend);
    char *deviceArray;

    // Allocate memory on GPU
    cudaMalloc((char**)&deviceArray,BytesToSend);
    memset(hostArray,0,BytesToSend);

    // Transfer hostArray from CPU to GPU
    cudaMemcpy(deviceArray,hostArray,BytesToSend,cudaMemcpyHostToDevice);
    // Get hostArray from GPU to CPU
    cudaMemcpy(hostArray,deviceArray,BytesToReceive,cudaMemcpyDeviceToHost);

    // Release memory from GPU
    cudaFree(deviceArray);
}

Now when using the NVidia profiling tools nvprof and nvvp we have the following results.

==18788== NVPROF is profiling process 18788, command: ./checkTransferMatMul
==18788== Profiling application: ./checkTransferMatMul
==18788== Profiling result:
Time(%)      Time     Calls       Avg       Min       Max  Name
 64.08%  2.1120us         1  2.1120us  2.1120us  2.1120us  [CUDA memcpy DtoH]
 35.92%  1.1840us         1  1.1840us  1.1840us  1.1840us  [CUDA memcpy HtoD]

==18788== API calls:
Time(%)      Time     Calls       Avg       Min       Max  Name
 98.74%  85.226ms         1  85.226ms  85.226ms  85.226ms  cudaMalloc
  0.92%  795.35us        83  9.5820us     838ns  345.44us  cuDeviceGetAttribute
  0.15%  132.42us         1  132.42us  132.42us  132.42us  cudaFree
  0.06%  54.336us         1  54.336us  54.336us  54.336us  cuDeviceTotalMem
  0.06%  52.521us         2  26.260us  23.607us  28.914us  cudaMemcpy
  0.05%  44.629us         1  44.629us  44.629us  44.629us  cuDeviceGetName
  0.00%  2.5850us         2  1.2920us     978ns  1.6070us  cuDeviceGetCount
  0.00%  2.0260us         2  1.0130us     838ns  1.1880us  cuDeviceGet

Here we have the following information

  • It takes 2100 nanoseconds to send data
  • It takes 1184 nanoseconds to receive data
  • Our cudaMemcpy function takes on total 52521 nanoseconds to complete

This means that even if we take 0 seconds to compute our speed up would be $$speedUp=\frac{110}{5221}=0.002$$, which means much slower.

So this means that if we cannot play with other variables, like latency where we could give the results some time later to improve the total throughput, we cannot use the GPU.

Can we speed-up on the CPU?

Another type of speed-up can be achieved with the use of the available cores on the CPU itself, there we have the advantage that the memory latency will be smaller.

Let's change the matrix multiplication function by adding a OpenMP compiler directive #pragma parallel for. This will instruct the compiler to run all the rows in parallel

void matrix_2d_mul_float(float *A, float *B, float *C, int num_rows_A, int num_cols_A, int num_cols_B) {
  float sum = 0;
  int num_rows_C = num_rows_A;
  int num_cols_C = num_cols_B;
  // Iterate on each row of A
  #pragma omp parallel for
  for(int i=0; i<num_rows_A; i++) {
    // Iterate on each collumn of B
    for (int k=0; k<num_cols_B; k++) {
      sum = 0;
      // Do the "multiply add between" row of A and collumn of B
      for (int j=0; j<num_cols_A; j++){
        // A[i][j] == A[i*num_cols_A+j]
        // B[j][k] == B[j*num_cols_B+k]
        //sum += A[i][j]*B[j][k];
        sum += A[i*num_cols_A+j]*B[j*num_cols_B+k];
      }
      // C[i][k] == C[i*num_cols_C+k]
      C[i*num_cols_C+k]=sum;
    }
  }
}

You just need now to add openmp on the compilation.

g++ -g -O1 -fopenmp seq_matrix_mul.c -o seq_matrix_mul

Now running the again the perf we have the following results

leo@leodev:~/work/learningOpenCl/chap1$ sudo perf stat ./seq_matrix_mul 
Size in bytes A: 36
Size in bytes B: 24

Vector A size: 9 { 1.00, 2.00, 3.00, 4.00, 5.00, 6.00, 1.00, 3.00, 2.00,}
Reference result for C

Matrix 3x2
| 30.00  33.00 |
| 87.00  93.00 |
| 35.00  24.00 |

Calculated result for C

Matrix 3x2
| 30.00  33.00 |
| 87.00  93.00 |
| 35.00  34.00 |


 Performance counter stats for './seq_matrix_mul':

      20012.558926      task-clock (msec)         #    7.898 CPUs utilized          
             4,777      context-switches          #    0.239 K/sec                  
                37      cpu-migrations            #    0.002 K/sec                  
                84      page-faults               #    0.004 K/sec                  
    58,413,342,405      cycles                    #    2.919 GHz                    
                 0      stalled-cycles-frontend   #    0.00% frontend cycles idle   
                 0      stalled-cycles-backend    #    0.00% backend  cycles idle   
    14,470,208,799      instructions              #    0.25  insns per cycle        
     3,757,468,603      branches                  #  187.756 M/sec                  
        17,796,067      branch-misses             #    0.47% of all branches        

       2.534001217 seconds time elapsed

Ok now the total time increased to 2.5 seconds which is much worse. The issue here is that we're wasting a lot of time synchronizing the cores, which is again a inevitable sequential part.

What to do on this case?

If you know for sure that the sequential algorithm is already the most smart to do the job you can still play with the compiler optimization.

leo@leodev:~/work/learningOpenCl/chap1$ g++ -g -O3 seq_matrix_mul.c -o seq_matrix_mul
leo@leodev:~/work/learningOpenCl/chap1$ perf stat ./seq_matrix_mul
Size in bytes A: 36
Size in bytes B: 24

Vector A size: 9 { 1.00, 2.00, 3.00, 4.00, 5.00, 6.00, 1.00, 3.00, 2.00,}
Reference result for C

Matrix 3x2
| 30.00  33.00 |
| 87.00  93.00 |
| 35.00  24.00 |

Calculated result for C

Matrix 3x2
| 30.00  33.00 |
| 87.00  93.00 |
| 35.00  34.00 |


 Performance counter stats for './seq_matrix_mul':

         60.987616      task-clock (msec)         #    0.984 CPUs utilized          
                24      context-switches          #    0.394 K/sec                  
                 4      cpu-migrations            #    0.066 K/sec                  
                56      page-faults               #    0.918 K/sec                  
        84,728,846      cycles                    #    1.389 GHz                    
                 0      stalled-cycles-frontend   #    0.00% frontend cycles idle   
                 0      stalled-cycles-backend    #    0.00% backend  cycles idle   
       285,011,535      instructions              #    3.36  insns per cycle        
        51,199,227      branches                  #  839.502 M/sec                  
             8,497      branch-misses             #    0.02% of all branches        

       0.061996921 seconds time elapsed

So we improved from 140ms to 61ms, which gives an $$speedup=\frac{140}{61}=2.2x$$.

Take away for the chapter

From this chapter we saw that we could not Speed-up our matrix multiplication example. If you remember the first talk about Amdahl's law, you may remember that what's is happening here is that the work needed to enable parallelization, add some sequential time that cannot be solved.

  • If you don't have enough data to process it's not worth parallelization.
  • Check if it's possible to exchange latency by throughput.
  • This amount depend on your system characteristics.
  • Use profiling tools to help you decide, don't assume that you know before hand the amount of data necessary to start parallelism.
  • Also check compiler optimization.

results matching ""

    No results matching ""