BeagleBoard/GSoC/2023 Proposal/OpenGLES acceleration for DL

=Proposal for OpenGLES acceleration for DL=
 * Student: Pratham Deshmukh
 * Code  : darknet
 * Mentors: Shreyas Atre
 * Proposal: OpenGLES acceleration for DL
 * Wiki  : NA
 * GSoC  : Proposal Request

=Status= This project is currently just a proposal.

=Proposal=
 * Completed All the requirements listed on the ideas page.
 * The PR request for cross-compilation task.

=About you=
 * IRC Nickname: Pratham
 * Github: Pratham Deshmukh
 * College: Veermata Jijabai Technological Institute
 * Country: India
 * Primary language: English, Hindi, Marathi
 * Typical work hours: 9am to 5pm
 * Previous GSoC participation: This is my first time participating in GSoC.

=About your project= Project name: OpenGLES acceleration for DL

Overview
Deep Learning is a subset of Machine Learning which involves use of Neural Network with multiple Layers. Neural networks consist of multiple layers of interconnected nodes, each building upon the previous layer to refine and optimise the prediction.

The main goal of the project is to accelerate as many layers types as possible using APIs such as OpenGLES, Vulkan and Darknet as Deep Learning framemork.

Shaders
Shaders are the user-defined program that run on the GPU of the board. The use of shaders for computation can result in significant speedup, as GPUs are designed to process large amounts of data in parallel. Out of various shaders that can be used on the GPU, the shaders I will be using are the Compute Shaders.They can be used to perform parallel computations, such as matrix multiplication and convolution, which are often used in deep learning applications. Compute shaders can be written using the GLSL programming language, and can be executed on the GPU using the glDispatchCompute function in the OpenGL API.

Darknet
Darknet is an open source neural network framework written in C and CUDA. It is fast,compatible, easy to install, and supports CPU and GPU computation. Darknet is used in the project to implement the YOLO object detection and recognition model.

You Only Look Once (YOLO) is a state-of-the-art, real-time object detection algorithm that uses a single neural network to detect objects. The YOLO model consists of multiple convolutional layers that extract features from the input image and several fully connected layers that produce the output of the model. These layers have many parameters that need to be optimised during training to achieve high accuracy in object detection and recognition.

In this project, Darknet is used as the deep learning framework to implement the YOLO model and optimise its performance.

YOLO Pipeline
Out of the various YOLO pipelines(YOLO,YOLOv2,YOLOv3,etc), I will be adapting YOLOv3 in this project. YOLOv3 is extremely fast and accurate. In mAP measured at .5 IOU YOLOv3 is on par with Focal Loss but about 4x faster. Moreover, you can easily trade off between speed and accuracy simply by changing the size of the model.

The YOLOv3 model consist of various layers such as Convolution layers, Route layer, Up-Sampling layers, Region layer, Maxpool layer etc. Thus we will be performing computations on these layers to accelerate the performance of the YOLOv3 model. To accelerate the performance of the YOLOv3 model, we will utilize the OpenGLES-enabled GPU on the target hardware platform. The GPU can be used to perform the computations required by certain layers in the neural network using parallel processing, which can greatly speed up the processing time.

Vulkan
Vulkan can be used for general-purpose computing, and provides features such as compute shaders, which can be used to perform complex computations in parallel on GPUs. The project will involve implementing and optimizing compute kernels for various layers of neural networks using Vulkan compute shaders. These kernels will include operations such as convolution, pooling and more. The Vulkan API will be used to manage resources such as buffers and images, as well as to schedule compute shader execution on the GPU.

Just like OpenGLES, this new interface describes what the application intends to do, which can lead to better performance and less surprising driver behaviour compared to existing APIs like OpenGLES. Vulkan is a newer API that provides more control and flexibility. It is designed to take advantage of modern GPU hardware and can provide better performance compared to OpenGLES in some cases.

One of the key feature of Vulkan is that the compute shader is completely separated from the graphics part of the pipeline.



With the compute shader stage being detached from the graphics pipeline we'll be able to use it anywhere.


 * Data type Extension
 * By default, the 32 bit floating precision is used for both training and inferencing, which are basically just running a computational graph:
 * Training runs a forward pass, and often times a backward pass to propagate back the gradient
 * Inferencing is just about doing the forward pass
 * But both can be done in lower precision types for faster compute time and reduced data storage.
 * The following extensions are available in vulkan:
 * VK_KHR_shader_float16_int8
 * VK_KHR_8bit_storage / VK_KHR_16bit_storage
 * 8 bit integers data types are used for quantized Neural Nets
 * FP16 data types can be used for faster math with gradient rescaling in training


 * Improved Compute Shader
 * New extensions are devised to improve efficiency
 * VK_KHR_workgroup_memory_explicit_layout
 * Allow more efficient data loading into shared memory for further use with efficient matrix multiplication operations.
 * VK_EXT_ML_primitives: Exposes basic primitives used in the main stream Neural Nets as optimized building blocks


 * Extension Available:
 * VK_NV_cooperative_matrix
 * Accelerates large, low-precision matrix multiplies
 * Exposes high throughput matrix/vector multiplication units.
 * Typically used be convolution / matmul layer in fp16 formats.
 * Core compute function for deep learning
 * Following Code snippets illustrates how you might employ the extension.

Benefits of Vulkan Compute Shaders

 * Highly parallelized computation: Compute shaders in Vulkan are designed to execute a large number of parallel computations on GPU hardware, which can provide significant performance benefits over CPU-based computations.
 * Flexibility: We can write custom shaders to perform a wide range of compute tasks, including machine learning inference. This allows for greater flexibility in optimising performance and achieving better accuracy.
 * Memory access: Compute shaders can access memory resources that are shared with graphics shaders, providing more efficient memory utilization and reducing the need for data transfers between the CPU and GPU.
 * Synchronization: Vulkan provides synchronization mechanisms for coordinating access to shared memory resources and ensuring that compute shaders execute in the correct order. This will allow us to take advantage of the parallelism of compute shaders while avoiding race conditions and other synchronization issues.

1. Identifying the layer that can benefit from GPU acceleration
Convolution layer:

It is a fundamental building block in deep neural networks. The convolution operation involves sliding a filter or kernel over an input image, computing dot products between the filter and local patches of the image to produce a feature map.The convolution layer is used extensively in the backbone network to extract high-level features from the input image. By adapting the convolution layer for acceleration using OpenGLES shaders, we can significantly speed up the computation time and improve the overall performance of the YOLOv3 model on resource-constrained devices.

Route layer:

The route layer can also be used in the implementation to accelerate the YOLOv3 pipeline using OpenGLES. The route layer is used to concatenate feature maps from different layers. It can concatenate two or more feature maps along the channel dimension. By doing so, it enables the network to combine features learned from different layers and extract more complex features.

Up-Sampling layer:

Upsampling layers can be used in the YOLO pipeline to increase the resolution of the feature maps before passing them to subsequent layers. Upsampling can be implemented using various techniques such as bilinear or nearest-neighbor interpolation, or transposed convolution.

Region layer:

The region layer is an important layer in the YOLOv3 model that is responsible for predicting the object bounding boxes and associated class probabilities.

Maxpool layer:

The maxpool layer can be used in the YOLO pipeline to downsample the feature maps and reduce their spatial resolution. The maxpool layer can be used to extract the most important features from each local region of the input feature map and reduce its size, thus reducing the computational cost of subsequent layers.

2. Writing the shader code using the OpenGLES and Vulkan API to perform the computations required by the selected layers on the GPU.
The shader code will need to be optimized for parallel processing. Here is an example of shader code for a convolution operation using the OpenGLES API:

The Vulkan APIs Shader Compute Program:

3. Integrate the shader code into the Darknet CNN framework, which is used to build the YOLOv3 model.
This may involve modifying the existing Darknet code to support the OpenGLES API calls.Integrating the shader code into the Darknet CNN framework involves modifying the existing codebase to support the OpenGLES API calls. The modified code allow for the execution of the selected layers on the GPU using the optimized shader code.The goal of integrating the shader code into the Darknet CNN framework is to allow for the efficient execution of the selected layers on the GPU, resulting in faster and more accurate object detection using the YOLOv3 model.

4. Compile and build the modified Darknet code with the integrated OpenGLES shaders
1. Installing Dependencies such as CUDA, OpenCV, etc. 2. Modifying and building the darknet code which involve adding code to the existing darknet file or will be creating new file. 3. Test and deploy the modified code.

Timeline
=Experience and approach= This project requires knowledge in Neural Networks, convolution, C/C++, Linux kernel and OpenGLES.
 * I have Previously Worked on the GPGPU-WITH-GLES project. Hence, I have good understanding of OpenGLES APIs, Shaders and Linux Kernels.
 * Since Vulkan requires some high level coding knowledge, I am alredy familiar with OpenGLES lamguage and it will be easy to learn Vulkan.
 * I am well-worsed with different types of GPU-capable shaders and I am aware of which of them would be suitable for this project.
 * I have also performed Operations such as Matrix Mulltiplication and transpose of a Matrix.
 * I have been exploring Neural Networks and Convolutions and have gained sufficient knowledge to start the implementation.
 * I also have beaglebone(pocket beagle) and have tried implementing the darknet framework on it.
 * I am passionate Open Source enthusiast and I will do the work wholeheartedly. I have my commitment to GSoC and I would do everything in my power to finish the project idea within the allotted time.
 * I will keep contributing to the project after GSoC and will be interacting with the community often.

=Contingency= If I get through any contingencies, I will refer the following resources:
 * I Have list of resources available online. So if I get stuck I will refer those resources.
 * I will use Beagle Slack to communicate with other mentors.

=Benefit=
 * The Performance of the YOLOv3 model is improved which will lead to better object detection.
 * Many layers can be accelerated at a time hence the efficiency of the model is improved.
 * Memory Usage is reduced by loading the computations on GPU as discussed here.