BeagleBoard/GSoC/2021 Proposal/GPGPU with GLES

=[GPGPU with GLES] = About Student: Jakub Duchniewicz Mentors: Hunyue Yau, Iain Hunter Code: https://github.com/JDuchniewicz/GPGPU-with-GLES Wiki: https://jduchniewicz.github.io/gsoc2021-blog/ GSoC:

=Status= In implementation

About you
IRC: jduchniewicz Github: JDuchniewicz School: University of Turku/KTH Royal Institute of Technology Country: Finland/Sweden/Poland Primary language: Polish Typical work hours: 8AM-5PM CET Previous GSoC participation: Participating in GSoC, especially with BeagleBoard would further develop my software and hardware skills and help me apply my current knowledge for the mutual benefit of the open source community. I planned to do the YOLO project, but after spending several days researching and preparing the proposal I found it is impossible to do on current BBAI/X15, hence I want to do another way of heterogeneous computing with GPGPU!

About your project
Project name: GPGPU with GLES computing framework

Description
Since BeagleBoards are heterogeneous platforms, why not use them to their full extent? Apparently for the time being the GPU block lies mostly useless and cannot assist with any computations due to non-technical reasons. There exists a library made by the IP manufacturer but is not open source, and thus does not fit into the BB's philosophy of open SW/HW. The IP manufacturer(Imagination Tech) has even a series of teaching lectures of the subject, available |here. There is a University program for OpenCL, but it requires being affiliated with a University. Normally, OpenCL would be used for accelerating on the GPU, but taking the limitations discussed above it is impossible, yet there is another way! GPGPU acceleration with OpenGLES.

The GPU accelerators from PowerVR (series SGX500) are available in all BB's and a unified acceleration framework for them is the topic of this proposal. The targeted version of OpenGL ES is 2.0, but extending it to the future versions will not be a problem from a technical point of view as the newer versions are backwards compatible. Moreover, it could open a way for new ways of computation acceleration, such as integer and 32-bit floating-point operations. However, the IP owner is not releasing newer versions of the standard on the current BB's even though it is possible, so the only way to leverage it is in new BeagleBoards with newer chips. The API of the library will allow for performing various mathematical operations (limited to a few for the GSoC part of the project), and will return the results to the user in the same format they input the data in.

Apart from the library, the project will explore the performance gains when using the GPU (compared to the CPU) and provide the user with guidelines on when it is feasible to put the BB's GPU accelerator to use and when to refrain from it. This could be a good resource for a training session as mentioned later in the proposal.

Implementation
The system will comprise two layers: the API layer and the implementation layer. The API will be a set of extensible user calls for accelerating various functions, while the implementation will be the part responsible for mapping the user data to the GPU via a shader program.



The picture above presents the main components of the system.

Single shot operation
The implementation of an expensive single shot operation is described along the user flow below.


 * 1) User creates the data buffers holding the data to be processed and calls relevant API
 * 2) The EGL context is established and a shader program is attached
 * 3) The data is translated and loaded to the GPU as a texture
 * 4) OpenGL renders the frame
 * 5) The data is read out and translated to the form chosen by the user
 * 6) The context is closed and cleaned up
 * 7) The data is returned to the user

The library should allow for both single-shot computations and reusing the established GL context and feeding the shader with new data every frame. The alternative flow (with continuous flow of the data), establishes relevant contexts and asks only for the user input each frame.

Continous operation
The alternative flow (for reusing the context) is presented below.
 * 1) User calls the relevant API for initialization of the context
 * 2) The EGL context is established and a shader program is attached
 * 3) User creates the data buffers holding the data to be processed and calls relevant API
 * 4) The data is translated and loaded to the GPU as a texture
 * 5) OpenGL renders the frame
 * 6) The data is read out and translated to the form chosen by the user
 * 7) The data is returned to the user
 * 8) The user calls the relevant API for deinitialization of the context
 * 9) The context is closed and cleaned up

Alternative ideas/Stretch goals

 * There are two GPU's on BBAI and BBX15 - the library may be extended to make use of both of them and work in a streaming fashion
 * Incorporating the DSPs into the acceleration by means of OpenCL (though not all BB have it)
 * Or making this part of the OpenCL acceleration scheme once it is proven to be stable and performant
 * Create a Rust version of this project :)

API Overview
The API will be a in a form of C bindings which will be accessible in a twofold way:
 * single-shot calls
 * continuous data processing with an obligatory init and a de-init

The skeleton of the API is presented below:

And the helper struct for GL operations:

Simple Computation example
A very basic example showing how the data is read from the texture is shown below. The main difference between these two calls will be, that one will be a static call, creating all the necessary bindings with the help of the GL operations class and then cleaning up. The other one will use a persistent main API class and perform the operations without allocating and deallocating for each call.

Even though having a C++ API allows for ease of expression, it may be a source of many restrictions. Ideally, Rust bindings to this can be made with ease and C++ code may directly use the C function calls. The main benefit of having it in pure C form is for all MCUs to support it. Other extensions include having functions for performing arbitrary math operations, such as quick rescaling or some DL related ones, like ReLU.

More advanced computation description
If we had the ability to use OpenGL ES 3.0, which introduced Compute shaders, we could use the compute shaders and parallelize arbitrary computations to the limit. Having to deal with OpenGL ES 2.0 we need to circumvent this with smart splitting the data into several fragment shaders and then executing them on different cores of the SGX GPU. Our targeted GPUs have up to 4 cores, so we will probably need to adjust the code depending on the architecture.

The more advanced operations which act on matrices, such as matrix multiplication or convolution could be split between cores and then assembled afterwards in the CPU or the GPU. Probably splitting the matrix multiplication may be ineffective, but doing convolution in parts may be much more effective.

Doing the matrix multiplication can be done in several ways, varying the ways we access the data and accumulate it. Since the texture is mapped to the GPU memory, we can load the matrix in whatever way we want. To assess how to approach this operation we need to know more about how this GPU operates, but basing on my current knowledge and on this document I assume it utilizes cache. Knowing this, we should think about changing the order in which the data is accessed in the matrix multiplication, since doing the ordinary ijk addressing accounts for many cache line misses. Basing on my previous experience, the ikj way of matrix multiplication was the most efficient. Additionally, the operation may be divided into blocks to even further make usage of caches inside the GPU.

Of course all of this requires advanced knowledge about the platform we are using - SGX GPUs. This article may be of some importance for understanding the low-level details of GPU operation as well.

In order to achieve maximum performance we need to thoroughly benchmark all the small tweaks and experiment with the settings, because what is working perfectly fine on one platform may be disastrous on the other.

Detailed Implementation
After the API is called, two things happen:
 * the context is either set up or reused
 * the data is transformed from user's format and mapped to the texture (assuming we use a texture)

The first thing is a matter of setting up all the necessary boilerplate code (which has been done by Hunyue in a small PoC), loading the vertex and fragment shaders and creating the GL ES program to be run on the GPU. I assume the computations are done in the fragment shader and vertex shader is constant for the application. The vertex shaders will be written for various computations and available to be loaded as either binary blobs or inline code.

The data transformation has to be done from whatever the format user supplied to floating point values in range from 0.0 to 1.0, because the shaders in OpenGL ES 2.0 operate only on floats. Also the data has to be mapped to a texture (if we use a texture for the computation).

Expected Performance
The performance of the GPU is historically known to be better with big data blocks and scarce transfers rather than small blocks and many transfers. Similarly, in this case the goal is to achieve good performance on large data blocks and focus on minimizing the transfer overhead.

Expected result is a non-stalling experience of using the API and processing frames. The user can normally use all the CPU resources during the frame processing, but the GPU(s) are blocked.

Action Items

 * setup the environment
 * setup SGX and run simple shaders
 * write the array addition
 * write the FIR convolution
 * write the matrix multiplication
 * write broadcast math operations
 * benchmark the functions with various sizes
 * write the examples
 * document
 * record video 1 + 2
 * write up about the process on my blog
 * polish the API and make it extensible

Deliverables

 * The C API with at least several accelerated operations
 * The documentation
 * Tutorials for understanding acceleration on the GPU and benchmarks

Since I am in love with the Rust programming language, as an optional deliverable I would also like to make a rust crate with bindings to this API so that Rust users on the BB's can use it for acceleration.

Timeline
During 25.07.21-08.08.21 I have a summer camp from my study programme and will be probably occupied for a half of the day. The camp will most likely be held online though.

Experience and approach
I have strong programming background in the area of embedded Linux/operating systems as a Junior Software Engineer in Samsung Electronics during December 2017-March 2020. Additionally I have developed a game engine (| PolyEngine) in C++ during this time and gave some talks on modern C++ during my time as a Vice-President of Game Development Student Group "Polygon".

Apart from that, I have completed my Bachelors degree at Warsaw University of Technology successfully defending my thesis titled: FPGA Based Hardware Accelerator for Musical Synthesis for Linux System. In this system I created a polyphonic musical synthesizer capable of producing various waveforms in Verilog code and deployed it on a De0 Nano SoC FPGA. Additionally I wrote two kernel drivers - one encompassed ALSA sound device and was responsible for proper synchronization of DMA transfers.

I am familiar with Deep Learning concepts and basics of Computer Vision. During my studies at UTU I achieved the maximal grades for my subjects, excelling at Navigation Systems for Robotics and Hardware accelerators for AI.

I have some experience working with OpenGL, mostly learning it for the programming engine needs and for personal benefit. Since this project does not require in-depth knowledge of it, but rather to create an abstraction over the OpenGL ES bindings and perform the necessary data conversions and extractions inside the API. This requires skills I do already possess and have proficiency using.

In my professional work, many times I had to complete various tasks under time pressure and choose the proper task scoping. Basing on this experience I believe that this task is deliverable in the mentioned time-frame.

Contingency
Since I am used to tackling seemingly insurmountable challenges, I will first of all keep calm and try to come up with alternative approach if I get stuck along the way. The internet is a vast ocean of knowledge and time and again I received help from benevolent strangers from reddit or other forums. Since I believe that humans are species, which solve problems in the best way collaboratively, I will contact #beagle, #beagle-gsoc and relevant subreddits (I received tremendous help on /r/FPGA, /r/embedded and /r/askelectronics in the past).

If all fails I may be able be forced to change my approach and backtrack, but this will not be a big problem, because the knowledge won't be lost and it will only make my future approaches better. Alternatively, I can focus on documenting my progress in a form of blogposts and videos while waiting for my mentor to come back to cyberspace.

With this project I see less problems than with the YOLO one, mainly because the track has been explored by Hunyue and because the OpenGL ES is a much wider used framework than TIDL. However, there may arise problems such as: getting the SGX to work with current BB image (as the SGX support is known to have been tricky in the past). The schedule of this year's GSoC is also much tighter so prioritizing things must be done right. After all, many things in this project can be stretch goals (especially algorithm implementations).

Resources
The resources used for this proposal, and for contingency:
 * OpenGL ES problems - |OpenGL ES book and Khronos documentation
 * Shader programming - |The Book of Shaders
 * General guidelines on the optimization - NVIDIA SDK Guidelines
 * GPU parallelism - GPU architecture detailed overview
 * GPU caches description - GPU caches article
 * GPGPU - Detailed thesis about GPGPU

Benefit
As the developers use BB's for computationally intensive tasks as DL inference or ML calculations, having a library which will allow them for offloading part of their computations to yet unused parts of the board would be beneficial. This way, the work may be parallelized and the BeagleBoard is a truly heterogeneous platform, allowing for using all of its components!

Educating the developers when to use and when to not use the GPU is also a valuable gain for the community, as not all BB developers are familiar with the limitations of the CPU calculations. This could be a tempting topic for an online training/live Q&A session. Having a reusable library, abstracting the low-level details of OpenGLES would reduce the hassle the programmers have to go to achieve acceleration.

Due to the same non technical issues, the GPU is rarely if ever used on the BeagleBoard. Support for current graphics output has been abysmal. This means we have an IP block that can do computations sitting idle. Even if it is not extremely fast, this can offload the ARM CPU making the Beagle's a truely asymetric multiprocessor SoC. ~ Hunyue Yau - ds2

Misc
The qualification PR is available here.