# Introduction

On this chapter we're going to see how can we accelerate some Deep learning operations using the Mali Gpu on the Juno Platform. (2x 1.1Ghz cpu cores). If you are not familiar with Deep learning concepts you may refer another book here.

## Use case Residual network

Our problem is to improve the performance of the Residual net, where the input is a 120x120x3 RGB image. Note that on our case we don't use batch norm blocks. And we do 52 convolutions, max max-pooling, 1 inner product operation.

### Time distribution

Normally on deep learning models, most of the time is spent on convolutions. Actually this is the first target that we want to accelerate.

Originally generating C code from matlab with no compiling optimization (-O0). The forward propagation takes 16 seconds to compute. Using compile optimization (-O3) and asking matlab to prioritize code efficiency this time go to 4 seconds. Which is cool due to the zero effort but still to slow.

No profile and optimization

```
# time ./resnet500
** created resnet500.mat **
real 0m15.894s
user 0m15.660s
sys 0m0.230s
```

No using optimization

```
# time ./resnet500_O3
** created resnet500.mat **
real 0m3.993s
user 0m3.710s
sys 0m0.280s
```

Full debug and profiling

```
# time ./resnet500_O0_Profiling
** created resnet500.mat **
real 0m17.551s
user 0m17.270s
sys 0m0.270s
```

### Running gprof

Now let's take a closer look on the profile (gprof). The first thing that we can observe is that convolutions take 90% of the computational time. The function resnet500*step call all the functions but it spend most of time loading the model parameters. Our worst-case was the function _resnet500_forward_conv_gp*

```
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
19.17 3.29 3.29 6 0.55 0.55 resnet500_forward_conv_gp
11.89 5.33 2.04 4 0.51 0.52 resnet500_forward_conv_h
10.14 7.07 1.74 3 0.58 0.58 resnet500_forward_conv_k
9.03 8.62 1.55 3 0.52 0.54 resnet500_forward_conv
7.46 9.90 1.28 6 0.21 0.22 resnet500_forwardConvolution_a
7.11 11.12 1.22 1 1.22 17.16 resnet500_step
6.12 12.17 1.05 5 0.21 0.21 resnet500_forward_conv_d
5.01 13.03 0.86 4 0.22 0.23 resnet500_forwardConvolution
4.55 13.81 0.78 4 0.20 0.20 resnet500_forwardConvolution_g
3.67 14.44 0.63 3 0.21 0.21 resnet500_forwardConvolution_gk
3.55 15.05 0.61 1 0.61 0.65 resnet500_forward_conv_h0
3.26 15.61 0.56 3 0.19 0.19 resnet500_forward_conv_l
2.56 16.05 0.44 2 0.22 0.22 resnet500_forward_conv_lk
2.16 16.42 0.37 2 0.19 0.20 resnet500_forward_conv_g
0.64 16.53 0.11 1 0.11 0.11 resnet500_forward_conv_fq
0.58 16.63 0.10 1 0.10 0.10 resnet500_forward_conv_ki
0.58 16.73 0.10 1 0.10 0.10 resnet500_forward_conv_f
0.29 16.78 0.05 3 0.02 0.02 resnet500_im2col_ref_p
0.29 16.83 0.05 1 0.05 0.05 resnet500_forward_conv_j
```

#### Locating on the model

Now let's locate our model where this function is located. On our case the model is in Simulink so it will be easy to locate since the source code point to the original model.

### Checking with Valgrind

Here we run valgrind(callgrind) on the host machine. It also has the same hotspots compared to the version working on the Juno board. **resnet500_forward_conv_gp**

On the source code we can see that indeed the matrix multiplication part is taking most of the execution time

### First matrix multiplication

On our network the first matrix multiplication A[64x147], B[147x3600]. Running the profiler for a naive implementation running on Juno (CPU).

```
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls ms/call ms/call name
99.16 0.96 0.96 1 961.87 961.87 matrix_2d_mul_float(float*, float*, float*, int, int, int)
1.03 0.97 0.01 main
0.00 0.97 0.00 2 0.00 0.00 fillRand(float*, int, int, int)
0.00 0.97 0.00 1 0.00 0.00 _GLOBAL__sub_I_num_rows_A
0.00 0.97 0.00 1 0.00 0.00 __static_initialization_and_destruction_0(int, int)
```

Which means that we spend 961.86 milliseconds. If we measure the time with our naive OpenCl implementation this time goes down to 41.4 milliseconds, 23x faster.

### Other matrix multiplication

From the profiler output we detect that the worst case of convolution was the **resnet500_forward_conv_gp**. It involve the im2col operation and a matrix multiplication A[256x2304] with B[2304x64].

Just the matrix multiplication takes 46ms which is already 12x faster than pure CPU.

```
Multiplying 2 matrices A[256,2304] * B[2304,64]
Size in bytes A: 2359296
Size in bytes B: 589824
Size in bytes C: 65536
Initializing OpenCL device...
Compiling OpenCL kernel...
Global size[256, 64]
Matrix multiplication done 0
Matrix multiplication done 1
Matrix multiplication done 2
Matrix multiplication done 3
Matrix multiplication done 4
Matrix multiplication done 5
Matrix multiplication done 6
Matrix multiplication done 7
Matrix multiplication done 8
Matrix multiplication done 9
Kernel Execution time = 45.749 ms
```

## What to do now.

So we know that our hotspot is on the convolution side. So we're going to create a simper model only with the worst case convolution (**resnet500_forward_conv_gp**).
From there we will look where to improve.