BeagleBoard/GSoC/2020Proposal/PrashantDandriyal 2

The page contains the second proposal to the initial project proposal for YOLO Models on x15/AI. The first proposal can be found here

=BeagleBoard/GSoC/Proposal : YOLO models on the X15/AI = The project aims at running YOLO (v2 tiny) model inference on the BeagleBone AI at an improved rate of ~30 FPS by leveraging the on-board hardware accelerators and inference optimisations through TIDL API and TIDL Library. The model is to be converted and imported to suit the API requirements.

Student: Prashant Dandriyal Mentors: Hunyue Yau Code: https://github.com/PrashantDandriyal/GSoC2020_YOLOModelsOnTheX15 Wiki: https://elinux.org/BeagleBoard/GSoC/2020Proposal/PrashantDandriyal GSoC entry

=Status= This project is currently just a proposal.

=Proposal= Completed the pre-requisites posted on the ideas page and have created a pull request to demonstrate the cross compilation #135.

About you
IRC: pradan Github: PrashantDandriyal School: Graphic Era University, Dehradun Country: India Primary language: Completed the requirements liEnglish, Hindi Typical work hours : 12PM-6PM IST Previous GSoC participation: None. The aim of participation stays bringing inference to edge devices; bringing the computation to the data than the other way round.

About your project
Project name: YOLO models on the X15/AI

Description
''In 10-20 sentences, what are you making, for whom, why and with what technologies (programming languages, etc.)? ''

The project objectives can be fulfilled by approaching two paths: either optimising the model (path 1) or the inference methods (path 2). We propose to follow both the path with limited scopes, as defined in the upcoming sections. Texas Instruments provides an API: TIDL (Texas Instruments Deep Learning) to follow both the paths. For path 1, we intend to use the API for converting the darknet-19 based model into the intermediate format accepted by TIDL. We target only the YOLO v2-tiny model over the YOLO v3-tiny models as currently, not all layers are supported by the API. Also, the v2 model fits into the on-chip memory of the BeagleBone AI and the x15. The v2 is available in MobileNet and Caffe versions, both of which are supported for model import by TIDL. The Python method for a similar model import is implemented as follows: python "tensorflow\python\tools\optimize_for_inference.py" --input=mobilenet_v1_1.0_224_frozen.pb  --output=mobilenet_v1_1.0_224_final.pb --input_names=input  --output_names="MobilenetV1/Predictions/Softmax"

The model import will certainly produce satisfactory results, using the techniques employed in path 2 (discussed later). For a further improvement in performance, we plan to re-train the model using the caffe-jacinto caffe fork. This Texas Instruments-ported framework offers the following benefits: For Path 2, we use the TIDL library (a part of TIDL API) to modify how the inference is made. The BeagleBone AI offers at least 2 Embedded Vision Engines (EVEs) and 2 C66x DSPs which help accelerate the frame processing by using multicore operation (with ~64MB memory buffer per core). We propose to use it for splitting the network layers between EVE subsystem and/or C66x cores. For example, to utilise all 4 accelerators (2 EVEs + 2 DSPs) for four different frames concurrently, arguments are passed:
 * Efficient CNN configuration (grouped-convolution method)
 * Introducing Sparsity and Quantizationin models
 * Layer grouping (for instance, the Convolution and ReLu layer are clubbed for reducing processing cost)

./tidl_classification -g 2 -d 1 -e 2 -l ./imagenet.txt -s ./classlist.txt -i 1 -c ./stream_config_j11_v2.txt

where arguments -g and -e denote the number of EVEs and DSPs respectively. Also, the configuration file used to specify the configuration of the network can also be used to group the layers to be processed at a particular core. e.g. assigning layers 12, 13 and 14 to layer group 2: layerIndex2LayerGroupId = { {12, 2}, {13, 2}, {14, 2} }

The software stack is shown in the below figure

Timeline
Provide a development timeline with a milestone each of the 11 weeks and any pre-work.

Experience and approach
In 5-15 sentences, convince us you will be able to successfully complete your project in the timeline you have described I am familiar with microcontrollers (32 bit) by Texas Instruments since my sophomore year of my Electronics and Communications Engineering. I have been a quarter-finalist in the India Innovation Challenge and Design Contest (IICDC-2018) during which our team was provided with Texas Instruments resources like the CC26X2R1 and TIVA LAUNCHPAD (EK-TM4C123GXL) EVM. I have been studying Machine Learning for about an year now. Mostly, I have used TensorFlow-Keras as the primary API. I have also participated in some of the ML competitions for a better exposure to problems. In my current semester, I have Neural Networks as a credit subject although I am already working on the topic in relation to On-device learning for low-computation-capable devices. I have implemented some simple neural networks in C and can be found in my GitHub account. I have studied digital signal processing as a credit subject and assume that it will also strengthen the foundation of my understanding of the convolutional neural networks used in this project. Regarding the languages, I have good experience of C and C++, both of which are he primary languages needed for this project. I have used C++ for several coding competitions held by Google.

The certainty of the project is validated by several factors. The project is similar to one of the demos provided in the TIDL API. The Single Shot multi-box detector demo uses a similar approach with the following differences:

The output of both the examples is similar to SSD. The performance of the TIDL demo can be used to approximate the performance of the YOLO v2-tiny model. With overhead distribution between the EVEs and C66x, the frame processing time is around 170ms which fulfils our target of running an inference at < 1s. The selection of the YOLO v2 (tiny) model has been done after cross-checking the supported neural network layers. The effect of on-board accelerators have been validated by the articles and technical papers by Texas Instruments. Here are the findings of Peka Varis who ran Image Segmentation on the SITARA processor that we aim to use (AM5729); he observed a 30-40% decrese in latency per frame wwhen using the AM5729 when compared to the AM5749 and an improved rate of 45 fps. Further, the use of JacintoNet11 dense and sparse models (both trained on Caffe-Jacinto), further enhances the performances dramatically. We expect to use similar methods and get corresponding performance.

Contingency
What will you do if you get stuck on your project and your mentor isn’t around? The set of resources / go-to-places I would use are as follows:
 * The TIDL documentation and the official guide for the processor SDK is the first-to-go resource. Along with that, TIDL programming model for programming reference, syntax and debugging issues
 * AM572X training series by Texas Instruments will be quite helpful in the beginning
 * TIDL API repository at git.ti
 * E2E forum for any unknown error/issue. I have already reached the experience of an intellectual on the forum.
 * Embedded Linux Classes by Mark A.Yoder on elinux. He even has detailed content on the x15 board.
 * Refer to the communities if the problem is related to BeagleBone boards. I have observed the open-source community (including the Google group) to be quite active.
 * In case the issue is related to the models, I will look for documentation of the related framework and its GitHub issues section if related to the models. The YOLO models have been around for a while now, so the support is quite good by now.

Benefit
''If successfully completed, what will its impact be on the BeagleBoard.org community? Include quotes from BeagleBoard.org community members''

With the completion of this project, the documentation will act as a beginner's guide to bring AI to the BeaglBone AI and (with some modifications) possibly to other BeagleBone boards too. For developers, the performance results will help in benchmarking the performances of similar edge devices. As the TIDL is still quite young, the observations born out this project will help collect issues and validations. Also, as the niche field of Edge AI is booming exponentially, the next GSoC projects can use the outcomes of this project to carry the work forward in different directions. With less than 2 percentage point accuracy compromise, sparsification and TI’s EVE-optimized deep-learning network model JacintoNet11, it is possible to improve the inference latency even further. : Peka Varis in his blog

One thing to note, if you are not using the TIDL API for your Vision AI apps, such as by porting over Raspberry Pi OpenCV code, then you are not using the accelerated TIDL hardware. You're not gaining a thing. : sjmill01 in his article on element14

Misc
Please complete the requirements listed on the ideas page. Provide link to pull request.

Suggestions
Is there anything else we should have asked you?