BeagleBoard/GSoC/2023 Proposal/OpenGLES acceleration for DL Minh Le

=BeagleBoard/GSoC/2023_Proposal/OpenGLES_acceleration_for_DL_Minh_Le = About Student: Minh Le Mentors: Shreyas Atre Wiki: Beagle AI-64 GSoC: TBD

=Status= This project is currently just a proposal.

=Proposal=
 * Completed All the requirements listed on the ideas page.
 * The PR request for cross-compilation task: task

About you
IRC: @whoknow123:matrix.org Github: yourcomrade School: Saxion University of Applied Science Country: Netherland Primary language: English, Vietnamese Typical work hours: 8am-12am and 5pm-12pm Central European Summer Time UTC+01:00 Previous GSoC participation: This is the first time I participate in GSoC

About your project
Project name: OpenGLES acceleration for DL

Overview
The goal of the project is to accelerate many types of layer in neural network by using GPU in BeagleBoard X15/AI-64. Currently, there are 3 ways to program GPU: either by using OpenGL ES, OpenCL or Vulkan. I choose OpenGL ES because it was supported by many GPUs compared to OpenCL and Vulkan. OpenGL ES has a special functionality called compute shader, which allows programmer to use GPU for general computing outside normal GPU using like rendering, drawing and so on. However, compute shader is only supported by OpenGL ES version 3.1 and upward. This accidentally creates barrier for older BeagleBoard AI as well as other BeagleBoard wants to use GPU to accelerate neural networks for inference. Therefore, I choose to use fragment shader which is available from OpenGL ES 2.0 and upward.

The framework I choose to implement OpenGL ES as a backend is Darknet framework as it is the smallest and simplest framework for neural networks I know. Darknet is the open-source neural network framework written in C and CUDA. Darknet is easy to install, fast to train and run neural networks in both CPU and GPU. Darknet is used to implement many variations of yolo models.

Mission
As I discuss with my mentor, the goal behind this project is to port the tiny-yolov3 model running on the Darknet framework to run on GPU. This tiny-yolov3 model uses GPU for inference, which means it loads pre-trained weight and run on GPU to make a prediction. From the configuration of tiny yolov3, the tiny yolov3 model architecture uses convolution layer, maxpool layer,up-sampling layer, and activation layer. Therefore, I will focus on speeding up the convolution layer, maxpool layer,up-sampling layer, and activation layer. After finishing them, I will start to implement them for the Darknet framework.

Expected performance
I expected to run the tiny-yolov3 model on GPU faster than at least 20% run on CPU. Ideally, the time complexity of running on GPU can be half of the time complexity running on CPU

Tasks
There are 2 tasks I have to do:
 * 1) Create a library that uses OpenGL ES to accelerate layers
 * 2) Implement that library to the Darknet framework

Library
The library will be implemented based on the project GPGPU with GLES API of the former GSoC contributor Jakub Duchniewicz. The idea for using OpenGLES for computing using GPU is that we draw off-screen a rectangle and transfer data into the form of 2D texture. Those data are represented in GPU memory as color pixels, and we can perform calculations on those color pixels using fragment shader, then we read them back to our memory.

Detail implementation of the library
Each neural network layer in the library will be implemented by using GLSL to speed up the performance.
 * 1) Activation layer: It is a fundamental layer that every neural network must have. The activation is a mathematical function to calculates a weighted total and then adds bias to it to decide whether a neuron should be activated or not. In the tiny-yolov3 models configuration file, the activation functions are leaky-relu function and  linear function. As activation layer functions are easy to implement, they will be implemented first. The input tensor is passed as a texture to the shader, and the activation function is applied element-wise to the input values. The resulting values are stored in the output texture, which represents the output of the activation layer.
 * 2) Maxpool layer: The max pool layer is a layer that conducts pooling operations. The pooling operation is calculates the maximum value for patches of a feature map, and uses it to create a downsampled feature map. The input feature map is passed as a texture to the shader, and the pooling operation is performed by sampling local neighborhoods of the input texture and taking the maximum value. The resulting values are stored in the output texture, which represents the downsampled feature map of the max pooling layer.
 * 3) Up-sampling layer: Upsampling layer is a layer in a neural network that increases the spatial resolution of an input feature map. There are a few methods to implement this operation. Based on the implementation of up-sampling layer in Darknet framework, the naive method will be used to implement up-sampling layer. The result can be retrieved from the output texture.
 * 4) Convolution layer: It is the key layer in many neural networks for image recognition. Convolutional layers are designed to extract local features from the input data, such as edges, textures, and other patterns. The input image is passed as a texture to the shader, and the convolution operation is performed using a set of convolutional kernels. Each kernel is represented as a 2D texture, and the convolution operation is performed by sampling the input texture with the kernel texture and computing the dot product of the sampled values. The resulting values are accumulated into the output texture, which represents the output feature map of the convolutional layer.

Implementation in Darknet framework
Currently, the Darknet framework relies on CUDA and OpenMP to accelerate computation. Both CUDA and OpenMP are used for inference and training models. However, this project aims to accelerate model inference, not for training, I will add OpenGLES to the Darknet framework to accelerate neural network layers for inference . Here is an example of what my API may look like in the Darknet framework: void forward_maxpool_layer_gles(maxpool_layer l, network net); void forward_upsample_layer_gles(const layer l, network_state state); void forward_activation_layer_gles(layer l, network_state state); void forward_convolutional_layer_gles(convolutional_layer layer, network_state state); These API layers will be used by the Darknet neural network when it will start to create and connect layers to build models. Because of model inference, they will load the weight of the layer, and image input into the fragment shaders, which are part of the OpenGL ES acceleration library, to compute for each layer and then return the result back to the memory. The result will either become another input for the next layer or be used to make a prediction depending on the state of the neural network.
 * 1) ifdef GLES
 * 1) endif

Deliverible results
At the end of the project, I will deliver these outcomes:
 * 1) The library that uses OpenGL ES to accelerate layers
 * 2) Documentations about the library
 * 3) Blog post about this project
 * 4) Darknet framework with OpenGL ES backend for tiny-yolov3
 * 5) A comparison chart between the performance on both CPU and GPU

Experience and approach
This project requires knowledge and experience in deep learning, graphic programming, parallel computing, and embedded Linux. I have experience with OpenGL, which is an API for graphic programming. I am also learning CUDA to understand more about the implementation of neural networks in GPU. In addition, I have knowledge about digital image processing to understand how neural network process images. For embedded Linux, I have experience with Raspberry Pi, and I have experience using Buildroot to build kernel, file systems for embedded Linux. As this is a complex project, I can work up to 35 hours per week for this project. Moreover, I am a hard-core open source enthusiast, and I will continue to contribute to this project to accelerate other layers after GSoC.

Contingency
If I get stuck on my project and my mentor is not around, I will use the following resources:
 * Forum of beaglebone
 * StackOverflow
 * OpenGL forum
 * Reddit
 * Former GSoC contributor Jakub Duchniewicz
 * Beagle AI-64 Documentation

Benefit
The outcome of this project is that users can run the tiny-yolov3 model from the Darknet framework on GPU of BeagleBoard AI to make predictions. Other BeagleBoards as well as other open-source hardware boards have GPU can also run the tiny-yolov3 model on GPU with few modifications in the library. The performance of tiny-yolov3 inference in GPU will be better than in CPU.