This is the development platform that we use on this book.
Some features on Juno
- 2 Fast Cortex A57 cores (1.1 Ghz)
- 4 Cortex A53 cores (850 Mhz)
- 4 Mali shaders cores (500 Mhz)
- 8Gb RAM DDR3
- 1Gbit Ethernet connection
On our case we're interested on the Mali-T624
Basically is a GPU composed with 4 cores (Compute Units) running at 500Mhz.
No Wavefront (So no divergence)
The Mali-T600, -T700 and -T800 series of GPUs are not wavefront based. With each thread having its own program counter, threads are entirely independent of each other so the above technique runs fine. In other words, we really have 4 independent cores.
On the juno platform the GPU memory is shared with the CPU (8gb DDR3), so memory transfers are faster, in theory it's just like a memcpy plus the time to invalidate the CPU cache.
Steps to use OpenCL on mali
Basically you need 2 things
- Have The Mali OpenCl SDK installed
- Have the rootfs with the proper Mali drivers: If not get driver here
Device Info on Juno (Mali-T624)
If we compile and execute our queryHost example on the Juno platform with the Mali drivers installed this is what we get.
root@genericarmv8:~/work# ./queryHost Number of OpenCL platforms found: 1 CL_PLATFORM_PROFILE: FULL_PROFILE CL_PLATFORM_VERSION: OpenCL 1.2 v1.r10p0-00rel0.83e65da3dbe0d5979ba9881967b24b6f CL_PLATFORM_NAME: ARM Platform CL_PLATFORM_VENDOR: ARM CL_PLATFORM_EXTENSIONS: cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_khr_3d_image_writes cl_khr_fp64 cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_fp16 cl_khr_gl_sharing cl_khr_icd cl_khr_egl_event cl_khr_egl_image cl_arm_core_id cl_arm_printf cl_arm_thread_limit_hint cl_arm_non_uniform_work_group_size cl_arm_import_memory Number of detected OpenCL devices: 1 GPU detected Device name is Mali-T624 Device vendor is ARM VENDOR ID: 0x6200010 Device max memory allocation: 2009 mega-bytes Device global cacheline size: 64 bytes Device global mem: 8038 mega-bytes Maximum number of parallel compute units: 4 Maximum dimensions for global/local work-item IDs: 3 Maximum number of work-items in each dimension: ( 256 256 256 ) Maximum number of work-items in a work-group: 256
Mali OpenCl execution model
- Every Mali-T600 thread has its own independent program counter (Warp size 1)
- OpenCL barrier operations (which synchronise threads) are handled by the hardware
- For full efficiency you need more work-groups than cores
- When running on Mali just use global memory
- Mali prefers explicit vector functions
- All CL memory buffers resides in global memory that is accessible by both CPU and GPU cores.
Inside a Mali Core
When we use a barrier the thread will enter the Texturing Pipeline, and will take much more cycles to complete. That's why it's preferable to use atomic functions, for synchronization.
Each ALU can make 17 float point operations per cycle, we have one per/core.
Use vector operations
It's the most first way to improve performance.
So basically now we choose 4x less threads (globalsize/4), less global access and more operations. (This actually hurt performance on NVIDIA)
Max number of work-items
The Mali-T624 can provide up to 256 "threads" or work-items divided between the 4-cores.
On the current Mali architecture the cores does not have a local-cached memory, so there is no real advantage on using local memory optimizations.
As ARM states to avoid using barriers on Mali Gpus, we should use kernels without this type of synchronization because it will hurt performance. You should try to use atomic functions instead.
Bridge different devices
Observe that even with Juno having multiple ARM cores, they are not available to the OpenCl platforms. On this case we still need to cross-compile an OpenCl driver for the ARM cores using the POCL project. Then you also need an OpenCL Installable Client Driver (ICD) Loader, to bridge different devices on the same platform. Some instructions to build the OpenCl ICD can be found here.