BeagleBoard/GSoC/2021 Proposal/YOLO models on the X15/AI

=[YOLO models on the X15/AI] = About Student: Jakub Duchniewicz Mentors: Hunyue Yau Code: not yet created! Wiki: https://elinux.org/BeagleBoard/GSoC/2021_Proposal/YOLO_models_on_the_X15/AI GSoC: YOLO Models on the X15/AI

=Status= Discussing the tentative ideas with Hunyue Yau and others on #beagle-gsoc IRC.

About you
IRC: jduchniewicz Github: JDuchniewicz School: University of Turku/KTH Royal Institute of Technology Country: Finland/Sweden/Poland Primary language: Polish Typical work hours: 8AM-5PM CET Previous GSoC participation: Participating in GSoC, especially with BeagleBoard would further develop my software and hardware(BB X15/AI architecture) skills and help me apply my current knowledge for the mutual benefit of the open source community. I aim to deliver a component which will be usable in many upcoming releases of YOLO model and hopefully other models.

About your project
Project name: YOLO models on the X15/AI (with an extensible interface for other models)

Description
The main idea of the project is to accelerate Deep Learning models with help of available hardware resources on the BB X15 and BB AI platforms. Current inference times are abysmal for any real-time (or even slightly laggy but bearable) application ranging from 15 to 35 seconds per frame. This is unacceptable and this project will alleviate this problem and enable efficient deployment of other models once the Texas Instruments Deep Learning library allows for that (RNNs, LSTMs and GRUs are planned to be released).

As more and more developers recognize the benefits coming from DL and utilizing specialized hardware for acceleration of these calculations, the inclusion of such support is vital for BeagleBoard community. Additionally, adding such a component may encourage new developers interested in developing DL solutions for embedded systems to join the effort and grow the BB developer community.

This project serves as a Pick and Place solution for running models on BBAI and X15, the programmer just adds their own network parameters and the network itself with the TIDL translation tool and calls the relevant API. This way, the programmer can effectively and quickly test the trained network on hardware.

The main focus on this project is to accelerate the YOLOv3 model using the TIDL library using C++ and maybe some intrinsics. In past there were some problems with YOLOv3 layers being not supported by TIDL, which should be fixed in this release (there is no confirmation on the forums with the AM5729, but the release notes of PSDK 6.03 mention support for EVEs + DSPs simultaneously).

Both X15 and AI could use the YOLOv3 models because they utilize very similar SoCs - AM5728 and AM5729. They both have DSPs and EVEs which I plan to use for realization of this task.

The end product of this project is a Github repository with the wrapper API and implementation of acceleration on the BBAI and X15. The repository can be cloned and built as a library or included as a single header? //TODO: check.

Alternative and Extension ideas

 * Using YOLOv4 instead of YOLOv3, |apparently runs 1 FPS faster on Jetson Nano than v3.
 * Using TFLite instead of TIDL, as TIs plans regarding the TIDL library are not prospective, there will be probably encouragement to use TFLite and AWS Sagemaker. However, this removes some fine-grained control over the model inference pipeline.
 * Ideally this would be a cornerstone for future model deployments, allowing for a simple and quick usage of DL models for a user of BB. The interface should support various degree of acceleration (no acceleration/EVEs only/DSPs only/EVEs + DSPs and mixed acceleration (including ARM/NEON). Also the user would be able to vary the degree of parallelization of the acceleration.
 * Having an option to run the models using unified high-level API on regular BB's (BBB or BBV) would be desirable, adding custom acceleration to them in the future.

Implementation
The picture above is a visualization of how this component will fit into TIDL ecosystem and allow easier deployment of YOLOv3 models on BB. (somehow it refuses my .png?)

My research into the TIDL framework proved it is possible to accelerate all layers of the YOLOv3 (what about upsample??) as shown in the paper.

Corresponding layers are:

If the network can be successfully deployed solely in the TIDL library, including offloading to ARM/NEON computations is not necessary. Otherwise such computations have to be performed and synchronized with the computation graph.

Since on BBAI/X15 there are 4 EVEs and 2 DSPs, the layers could be grouped so that their strengths facilitate the calculation process. For example, the EVEs could focus on 2D convolution operations while DSPs can do Pooling and Softmax, operations present in the YOLOv3 architecture.

Because we cannot parallelize the network sequentially (further layers depend on the input from previous), we must construct a good dependency graph and try to see where different computation resources may be utilized.

The TIDL library API guidelines propose 3 approaches to utilizing the library:
 * One Execution Object(hardware resource) per frame
 * Splitting a frame between EOs
 * Utilizing Execution Object Pipelines for double buffering

Among these, first one is the least efficient and the last one seems to be the most. At the moment I am not sure if a frame could be effectively split between EOs without losing some detections (probably could be somehow amended, but this is very difficult).

YOLOv3 utilizes bounding boxes with anchors which often span significant areas of the image. Chopping the image into parts would render this detection impossible.

We can use double buffering though, if we want to achieve good streaming performance and mitigate low FPS which will be surely present given limited resources. Having two pipelines running simultaneously can help reduce stuttering and give smoother detection for the user.

Combining these approaches would be ideal if somehow we can circumvent the inability to chop the image into several parts.

In case there are problems with available memory, the OpenCL memory DDR3 area can be extended as mentioned | here. This should allow for efficient utilization of available RAM and deployment of heavier models, like YOLOv3.

It is possible that some parts of the model will be not deployable or have subpar efficiency, then special programming subroutines for NEON FP instructions will have to be used. Also raw DSP programming and synchronizing heterogeneous processors using OpenCL may be in scope of this project in case of problems.

API Overview
User chooses the model (for now just YOLOv3)

Expected Performance
Basing on previous deployments of YOLOv3 on various embedded systems (Jetson Nano, Raspberry Pi), the expected results are:
 * 1 FPS with full acceleration, or
 * 2-3 FPS with full acceleration if the parallelization can be done in a smart way

Expected result is a non-stalling experience of using the API and processing frames. The user can normally use all the CPU resources during the frame processing, but the EVEs and DSPs are blocked.

Action Items

 * setup the environment
 * run sample models on the BB
 * obtain a deployable YOLOv3 model
 * load the model to TIDL
 * create initial scaffolding
 * write basic implementation using one EO
 * partition the model and create a graph
 * parallelize the implementation
 * research dividing the image into smaller segments for parallelization
 * polish the implementation
 * record video 1 + 2
 * write up about the process on my blog
 * benchmark different approaches
 * polish the API and make it extensible
 * write a sample user application(s)

Timeline
During 25.07.21-08.08.21 I have a summer camp from my study programme and will be probably occupied for a half of the day. The camp will most likely be held online though.

Experience and approach
I have strong programming background in the area of embedded Linux/operating systems as a Junior Software Engineer in Samsung Electronics during December 2017-March 2020. Additionally I have developed a game engine (| PolyEngine) in C++ during this time and gave some talks on modern C++ during my time as a Vice-President of Game Development Student Group "Polygon".

Apart from that, I have completed my Bachelors degree at Warsaw University of Technology successfully defending my thesis titled: | FPGA Based Hardware Accelerator for Musical Synthesis for Linux System. In this system I created a polyphonic musical synthesizer capable of producing various waveforms in Verilog code and deployed it on a De0 Nano SoC FPGA. Additionally I wrote two kernel drivers - one encompassed ALSA sound device and was responsible for proper synchronization of DMA transfers.

I am familiar with Deep Learning concepts and basics of Computer Vision. During my studies at UTU I achieved the maximal grades for my subjects, excelling at Navigation Systems for Robotics and Hardware accelerators for AI.

In my professional work, many times I had to complete various tasks under time pressure and choose the proper task scoping. Basing on this experience I believe that this task is deliverable in the mentioned time-frame.

Contingency
Since I am used to tackling seemingly insurmountable challenges, I will first of all keep calm and try to come up with alternative approach if I get stuck along the way. The internet is a vast ocean of knowledge and time and again I received help from benevolent strangers from reddit or other forums. Since I believe that humans are species, which solve problems in the best way collaboratively, I will contact #beagle, #beagle-gsoc and relevant subreddits (I received tremendous help on /r/FPGA, /r/embedded and /r/askelectronics in the past).

If all fails I may be able be forced to change my approach and backtrack, but this will not be a big problem, because the knowledge won't be lost and it will only make my future approaches better. Alternatively, I can focus on documenting my progress in a form of blogposts and videos while waiting for my mentor to come back to cyberspace.

In case I cannot deploy YOLOv3, I will go with its younger sibling - YOLOv2 or even the _tiny_ version of v3. In case of v2, we will lose some accuracy and speed of execution which is why I prefer to deploy the YOLOv3 model. With the _tiny_ version, we will have very small inference times (hence big FPS), but the accuracy will be much lower and since BBAI and X15 are quite powerful, deploying _tiny_ models would be under-usage of its resources.

Thes resources used for this proposal, and for contingency:
 * Bulleted list item

Benefit
The BB X15 and BB AI will be able to perform inference using YOLOv3 models in near real-time (maybe even allowing for using these boards for complex computer vision tasks). Additionally the BeagleBoard software codebase will have a good interface for deploying other models which would abstract the lower details of TIDL(or some other library future boards may use) interactions. The software will be prepared for rollout of newer and more advanced models.

Having such a component, BB community could easily deploy most common models on various boards and in turn reduce the effort required for implementing a computer vision solution.

As mentioned earlier, making this project a Pick and Place solution would make the training to deployment time and effort much lower for chosen networks.

Misc
The qualification PR is available here.