Difference between revisions of "BeagleBoard/GSoC/2020Proposal/PrashantDandriyal 2"

From eLinux.org
Jump to: navigation, search
(Timeline)
(Timeline)
 
(15 intermediate revisions by the same user not shown)
Line 26: Line 26:
 
''School'': Graphic Era University, Dehradun<br>
 
''School'': Graphic Era University, Dehradun<br>
 
''Country'': India<br>
 
''Country'': India<br>
''Primary language'':  
+
''Primary language'': English, Hindi<br>
Completed the requirements liEnglish, Hindi<br>
 
 
''Typical work hours'' : 12PM-6PM IST<br>
 
''Typical work hours'' : 12PM-6PM IST<br>
 
''Previous GSoC participation'': None. The aim of participation stays bringing inference to edge devices; ''bringing the computation to the data than the other way round''.
 
''Previous GSoC participation'': None. The aim of participation stays bringing inference to edge devices; ''bringing the computation to the data than the other way round''.
Line 38: Line 37:
  
 
The project objectives can be fulfilled by approaching two paths: either optimising the model ('''path 1''') or the inference methods ('''path 2'''). We propose to follow both the path with limited scopes, as defined in the upcoming sections.
 
The project objectives can be fulfilled by approaching two paths: either optimising the model ('''path 1''') or the inference methods ('''path 2'''). We propose to follow both the path with limited scopes, as defined in the upcoming sections.
Texas Instruments provides an API: TIDL [http://downloads.ti.com/processor-sdk-linux/esd/docs/latest/linux/Foundational_Components/Machine_Learning/tidl.html (Texas Instruments Deep Learning)] to follow both the paths. For '''path 1''', we intend to use the API for converting the darknet-19 based model into the intermediate format accepted by TIDL. We target only the [https://arxiv.org/pdf/1612.08242.pdf YOLO v2-tiny model] over the YOLO v3-tiny models as currently, not all layers are supported by the API. Also, the v2 model fits into the on-chip memory of the BeagleBone AI and the x15. The v2 is available in MobileNet and Caffe versions, both of which are supported for model import by TIDL. The Python method for a similar model import is implemented as follows:
+
The software stack is shown in the adjacent figure
 +
[[File:Stack_gsoc.png|300px|thumb|right|Figure 1 : Software stack. Derived from [[http://downloads.ti.com/processor-sdk-linux/esd/docs/latest/linux/Foundational_Components/Machine_Learning/tidl.html TIDL docs]]]]
 +
 +
Texas Instruments provides an API: TIDL [http://downloads.ti.com/processor-sdk-linux/esd/docs/latest/linux/Foundational_Components/Machine_Learning/tidl.html (Texas Instruments Deep Learning)] which is to be used in either of the paths. For '''path 1''', we intend to use the API for converting the darknet-19 based YOLO v2-tiny model into the intermediate format accepted by TIDL. We target only the [https://arxiv.org/pdf/1612.08242.pdf YOLO v2-tiny model] over the YOLO v3-tiny models as currently, not all layers are supported by the API. Also, the v2 model fits into the on-chip memory of the BeagleBone AI and the x15. The v2 is available in MobileNet and Caffe versions, both of which are supported for model import by TIDL. The Python implementation for a similar model import is implemented as follows:
 
      
 
      
 
  python "tensorflow\python\tools\optimize_for_inference.py"  --input=mobilenet_v1_1.0_224_frozen.pb  --output=mobilenet_v1_1.0_224_final.pb --input_names=input  --output_names="MobilenetV1/Predictions/Softmax"
 
  python "tensorflow\python\tools\optimize_for_inference.py"  --input=mobilenet_v1_1.0_224_frozen.pb  --output=mobilenet_v1_1.0_224_final.pb --input_names=input  --output_names="MobilenetV1/Predictions/Softmax"
  
The model import will certainly produce satisfactory results, using the techniques employed in _path 2_(discussed later). For a further improvement in performance, we plan to re-train the model using the [https://github.com/tidsp/caffe-jacinto caffe-jacinto caffe fork]. The validity of the performance is shown in the <ref>Approach</ref> section. This Texas Instruments-ported framework offers the following benefits:  
+
The model import will certainly produce satisfactory results, using the techniques employed in '''path 2''' (discussed later). Post model-import, we leverage some optimisations on the converted model (.bin file):  
* Efficient CNN configuration (grouped-convolution method)
+
* Efficient CNN configuration through automatic layer combination during import process.
* [http://openaccess.thecvf.com/content_cvpr_2017_workshops/w4/papers/Mathew_Sparse_Quantized_Full_CVPR_2017_paper.pdf Introducing Sparsity and Quantization in models]  
+
* Introducing [http://openaccess.thecvf.com/content_cvpr_2017_workshops/w4/papers/Mathew_Sparse_Quantized_Full_CVPR_2017_paper.pdf Sparsity and Quantization ]in models   
* Layer grouping (for instance, the Convolution and ReLu layer are clubbed for reducing processing cost)
+
The techniques are brought in using the 'configuration file' used during every model import.  
For '''Path 2''', we use the TIDL library (a part of TIDL API) to modify how the inference is made. The [https://beagleboard.org/ai /BeagleBone AI] offers at least 2 Embedded Vision Engines (EVEs) and 2 C66x DSPs which help accelerate the frame processing by using multicore operation (with ~64MB memory buffer per core). We propose to use it for splitting the network layers between EVE subsystem and/or C66x cores. For example, to utilise all 4 accelerators (2 EVEs + 2 DSPs) for four different frames concurrently, arguments are passed:
 
  
./tidl_classification -g 2 -d 1 -e 2 -l ./imagenet.txt -s ./classlist.txt -i 1 -c ./stream_config_j11_v2.txt
+
For '''Path 2''', we use the TIDL library (a part of TIDL API) to modify how the inference is made. The [https://beagleboard.org/ai BeagleBone AI] offers 4 Embedded Vision Engines (EVEs) and 2 C66x DSPs which help accelerate the frame processing by using multicore operation (with ~64MB memory buffer per core). Using these cores allows us to
 +
* Distribute the overload per frame using 'double buffering'
 +
* Distribute frames among cores
 +
* Distribute distribute network overload of each frame (better known as 'layer grouping')
  
where arguments ''-g'' and ''-e'' denote the number of EVEs and DSPs respectively. Also, the configuration file used to specify the configuration of the network can also be used to group the layers to be processed at a particular core. e.g. assigning layers 12, 13 and 14 to layer group 2:
+
We use these techniques in 2 approaches: 'approach_1' and 'approach_2' as explained in the demonstration section.
layerIndex2LayerGroupId = { {12, 2}, {13, 2}, {14, 2} }
 
  
The software stack is shown in the below figure
+
===Demonstration===
[[File:Stack_gsoc.png|300px|thumb|center|Figure 1 : Derived from [[http://downloads.ti.com/processor-sdk-linux/esd/docs/latest/linux/Foundational_Components/Machine_Learning/tidl.html TIDL docs]]]]
+
This section contains details of the demo created by me, to highlight the programming model of the proposed project.
 +
 
 +
'''1) Approach 1: One _Execution Object_ (EO) per frame with (Only EVEs)'''
 +
 
 +
Process 1 frame per ''EO'' or 1 per ''EOP'' (4 EVEs and 2 DSPs). This means 6 frames per EO. Above mentioned demo uses 2 EVEs + 2 DSPs (4 ''EOs'') but not for distibuting frames but for layer grouping. Hence, the overall effect is that of a single frame at a time. This method doesn't leverage the layer grouping. The expected performance is '''6x''' (10ms+2ms API overhead). The method is memory intensive beacause each ''EO'' is alloted input and output buffer individually. The source code is developed assuming pre-processed input data is available. In all other cases, OpenCV tools are readily available to do so.
 +
 
 +
Source Code: [https://github.com/PrashantDandriyal/GSoC2020_YOLOModelsOnTheBB_AI/blob/master/approach_1.cpp  aproach_1]
 +
 
 +
Network heap size : `64MB/EO x 6 EO = 384MB`
 +
 
 +
'''2) Approach 2: Two ''EO'' per frame using Double Buffering (EVEs+DSPs)'''
 +
[[File:Eops_pipeline.png|500px|thumb|right|Figure 2 : Pipeline of EOs. Derived from [[http://downloads.ti.com/mctools/esd/docs/tidl-api/_images/tidl-frame-across-eos-opt.png]]]]
 +
The second approach is similar to the one adopted in the [http://downloads.ti.com/mctools/esd/docs/tidl-api/example.html#imagenet imagenet demo] of TIDL, but the DSPs are replaced with additional EVEs. The pipelining used in the demo can be used to understand this approach also. For further detail,refer to the [https://github.com/PrashantDandriyal/GSoC2020_YOLOModelsOnTheBB_AI/edit/master/README.md GitHub page] of the demo.
 +
The TIDL device translation tool assigns layer group ids to layers during the translation process. But if the assignment fails to distribute the layers evenly, we use explicit grouping using the configuration file or the main cpp file. In this, for each frame, the first few layers (preferrably half of them) are grouped to be executed on EVE0 and the remaining half are grouped to run on EVE1. Similarly for the other frame on EVE3 and EVE 4. There are 4 ''EOs'' (4 EVEs and 0 DSPs) and 2 ''EOPs'' (Each ''EOP'' contains a pair of EVE). We process 1 frame per ''EOP'', so 2 frames at a time. A good performance is expected due to the  distribution of overload between the EVEs and use [http://downloads.ti.com/mctools/esd/docs/tidl-api/using_api.html#using-eops-for-double-buffering ''double buffering''].
 +
 
 +
Source Code: [https://github.com/PrashantDandriyal/GSoC2020_YOLOModelsOnTheBB_AI/blob/master/approach_2.cpp aproach_2]
  
 
===Timeline===
 
===Timeline===
''Provide a development timeline with a milestone each of the 11 weeks and any pre-work.''
 
 
 
{| class="wikitable"
 
{| class="wikitable"
 
| April 27 || Pre-Work || Community Bonding Period and discussion on the project and resources available.
 
| April 27 || Pre-Work || Community Bonding Period and discussion on the project and resources available.
 
|-
 
|-
| May 25 || Milestone #1, || * Introductory YouTube video
+
| May 25 || Milestone #1 ||  
 +
* Introductory YouTube video
 
* Involve with community and mentors and discuss the execution of the project if needed
 
* Involve with community and mentors and discuss the execution of the project if needed
 
* Collect literature related to the TIDL API  
 
* Collect literature related to the TIDL API  
 
|-
 
|-
| June 1 || Milestone #2 || - Setup environment for development like TIDL SDK for the AM57x processor
+
| June 1 || Milestone #2 ||  
* Run demos provided in the API on the actual hardware (the BeagleBone AI) to validate the setup
+
* Setup working environment for development like TIDL SDK for the AM57x processor and IDE
 +
* Run TIDL demos provided in the API on the actual hardware (the BeagleBone AI) to validate the setup
 
|-
 
|-
| June 19 || Milestone #4 || * I have my exams around the first week of June (no exact dates are provided yet).
+
| June 19 || Milestone #3 ||  
* Collect mobilenet and Caffe versions of YOLO v2 models (may stick with just one if it works fine
+
* I have my exams around the first week of June (no exact dates are provided yet).
* Demonstrate improved performance by running local system inferences and comparing with previous works.
+
* Port TensorFlow or Caffe versions of YOLO v2-tiny models using TIDL import feature
- Document it
+
* Test the converted model (.bin files) using simple inference on single frame
- Submit report for Phase 1 Evaluation
+
* Demonstrate improved performance by using the configuration file to use sparse convolution
 +
* Repeat the process with fixed/dynamic quantization using the configuration file
 +
* Submit report for Phase 1 Evaluation
 
|-
 
|-
| June 22 || Milestone #5 || - Get reviews from mentors and discuss modifications for the project plan with the mentors
+
| June 22 || Milestone #4 ||  
 +
* Get reviews from mentors and discuss modifications for the project plan with the mentors
 +
* Finish any backlogs
 +
* Obtain equivalent model(s) using the model visualiser to cross check merged layers and better network configuration
 
|-
 
|-
| June 29 || Milestone #6 || * Port the YOLO models using the model import feature of TIDL and test the performance by experimenting and finding perfect settings for the model parameters required in the configuration file
+
| June 29 || Milestone #5 || Begin with approach 1
provided during inference
+
* Obtain image data in RAW format for test
* Finalise the model to be used after performance results comparison
+
* Pre-process image data if using OpenCV tools in application
- Run first test on BeagleBone board
+
* Create source file (cpp) for using all the 4 EVEs + 2 DSPs individually
- Document the issues/modifications made
+
* Create an executable for the source files (if needed)
 +
* Compare with results without any accelerator and document
 
|-
 
|-
| July 6 || Milestone #7 || * Try and obtain better results using layer grouping
+
| July 6 || Milestone #6 || Begin with approach 2
* Optimise the other parts of the data feeding-pipeline if needed
+
* Create source file for using all the 4 EVEs through double buffering
* Obtain equivalent model(s) using the model visualizer to cross check merged layers and better network configuration, if needed
+
* Add layer grouping and determine optimum way of distributing the layers among the cores (EOs) heuristically
* Distribute frame overload among the 2 EVEs and 2 DSPs and document the performance improvement
+
* Create configuration file for it using the optimum configuration of the layer distribution determined in last step
 +
* Run test on pre-processed image data executable
 
|-
 
|-
| July 13 || Milestone #8 || * Test on image and video data
+
| July 13 || Milestone #7 ||  
* Gather performance results and compare with previous works
+
* Run approach 1 and approach 2 tests on video data.
* Gather training data for training the YOLO model using Caffe-Jacinto framework
+
* Process frames using OpenCV or directly feed pre-processed data
* Plan second evaluation report
+
* Modify the source files accordingly
 +
* Gather performance results and find (improved) FPS
 
|-
 
|-
| July 17-20 || Milestone #9 || - Submit second evaluation report
+
| July 17-20 || Milestone #8 ||  
 +
* Submit second evaluation report
 
* Discuss possible improvements with mentors
 
* Discuss possible improvements with mentors
* Train custom model using Caffe-Jacinto framework
 
* Document the entire process and settings used
 
|-
 
| July 27 || Milestone #10 || * Include different sparsity in the trained models and find best one for our test case
 
* Introduce 8 bit and 16 bit quantization in the sparse model and find best setting
 
 
|-
 
|-
| August 3 || Milestone #11 || - Completion YouTube video
+
| August 3 || Milestone #9 ||  
- Detailed project tutorial
+
* Complete any backlog
 +
* Completion YouTube video
 +
* Detailed project tutorial
 
|-
 
|-
| August 10 - 17 || Final week || - Get the final report reviewed by mentors and improvise changes advised
+
| August 10 - 17 || Final week ||  
- Submit final report
+
* Get the final report reviewed by mentors and improvise changes advised
 +
* Submit final report
 
|}
 
|}
  
Line 117: Line 142:
 
{| class="wikitable"
 
{| class="wikitable"
 
|-
 
|-
! TIDL SSD demo  !! YOLO v2 example
+
! TIDL SSD demo  !! YOLO v2-tiny example
 
|-
 
|-
| Input size (768 x 320) || Input size (224 x 224)
+
| Input size (768 x 320) || Input size (416 x 416)
 
|-
 
|-
 
| 43 Layers || 19 Layers
 
| 43 Layers || 19 Layers
 
|-
 
|-
| Upto 20 classes  || Upto 9418 classes
+
| Upto 20 classes  || Upto 80 classes
 
|-
 
|-
 
| Uses caffe-Jacinto model || Uses Darknet-19 based model
 
| Uses caffe-Jacinto model || Uses Darknet-19 based model
Line 129: Line 154:
  
 
The output of both the examples is similar to SSD. The performance of the TIDL demo can be used to approximate the performance of the YOLO v2-tiny model. With overhead distribution between the EVEs and C66x, the frame processing time is around 170ms which fulfils our target of running an inference at < 1s.  
 
The output of both the examples is similar to SSD. The performance of the TIDL demo can be used to approximate the performance of the YOLO v2-tiny model. With overhead distribution between the EVEs and C66x, the frame processing time is around 170ms which fulfils our target of running an inference at < 1s.  
The selection of the YOLO v2 (tiny) model has been done after cross-checking the supported neural network layers. The effect of on-board accelerators have been validated by the articles and technical papers by Texas Instruments. [https://e2e.ti.com/blogs_/b/process/archive/2019/12/02/maximizing-machine-learning-inference-at-the-edge Here] are the findings of [https://e2e.ti.com/members/804931 Peka Varis] who ran Image Segmentation on the SITARA processor that we aim to use (AM5729); he observed a 30-40% decrese in latency per frame wwhen using the AM5729 when compared to the AM5749 and an improved rate of 45 fps. Further, the use of ''JacintoNet11'' dense and sparse models (both trained on Caffe-Jacinto), further enhances the performances dramatically. We expect to use similar methods and get corresponding performance.
+
The selection of the YOLO v2 (tiny) model has been done after cross-checking the supported neural network layers. The effect of on-board accelerators have been validated by the articles and technical papers by Texas Instruments. [https://e2e.ti.com/blogs_/b/process/archive/2019/12/02/maximizing-machine-learning-inference-at-the-edge Here] are the findings of [https://e2e.ti.com/members/804931 Peka Varis] who ran Image Segmentation on the SITARA processor that we aim to use (AM5729); he observed a 30-40% decrese in latency per frame wwhen using the AM5729 when compared to the AM5749 and an improved rate of 45 fps. Further, the use of dense and sparse models, further enhances the performances dramatically.
  
 
===Contingency===
 
===Contingency===
Line 146: Line 171:
  
 
With the completion of this project, the documentation will act as a beginner's guide to bring AI to the BeaglBone AI and (with some modifications) possibly to other BeagleBone boards too. For developers, the performance results will help in benchmarking the performances of similar edge devices. As the TIDL is still quite young, the observations born out this project will help collect issues and validations. Also, as the niche field of ''Edge AI'' is booming exponentially, the next GSoC projects can use the outcomes of this project to carry the work forward in different directions.  
 
With the completion of this project, the documentation will act as a beginner's guide to bring AI to the BeaglBone AI and (with some modifications) possibly to other BeagleBone boards too. For developers, the performance results will help in benchmarking the performances of similar edge devices. As the TIDL is still quite young, the observations born out this project will help collect issues and validations. Also, as the niche field of ''Edge AI'' is booming exponentially, the next GSoC projects can use the outcomes of this project to carry the work forward in different directions.  
  With less than 2 percentage point accuracy compromise, sparsification and TI’s EVE-optimized deep-learning network model JacintoNet11, it is possible to improve the inference latency even further. : Peka Varis in his [https://e2e.ti.com/blogs_/b/process/archive/2019/12/02/maximizing-machine-learning-inference-at-the-edge blog]
+
  With less than 2 percentage point accuracy compromise, sparsification and TI’s EVE-optimized deep-learning network model JacintoNet11, it is possible to improve the inference latency even further. : Peka Varis in his [https://e2e.ti.com/blogs_/b/process/archive/2019/12/02/maximizing-machine-learning-inference-at-the-edge] blog
 +
 
 +
One thing to note, if you are not using the TIDL API for your Vision AI apps, such as by porting over Raspberry Pi OpenCV code, then you are not using the accelerated TIDL hardware.  You're not gaining a thing. : [https://www.element14.com/community/people/sjmill01 sjmill01] in his [https://www.element14.com/community/people/sjmill01 article] on element14
  
 
==Misc==
 
==Misc==
Please complete the requirements listed on the [[BeagleBoard/GSoC/Ideas#General_requirements|ideas page]]. Provide link to pull request.
+
Completed the requirements listed on the [[BeagleBoard/GSoC/Ideas#General_requirements|ideas page]]. Link to pull request : [https://github.com/jadonk/gsoc-application/pull/135 '''#135''']..
  
 
===Suggestions===
 
===Suggestions===
 
Is there anything else we should have asked you?
 
Is there anything else we should have asked you?

Latest revision as of 04:40, 31 March 2020

The page contains the second proposal to the initial project proposal for YOLO Models on x15/AI. The first proposal can be found here

BeagleBoard/GSoC/Proposal : YOLO models on the X15/AI

{{#ev:youtube|Jl3sUq2WwcY||right|BeagleLogic}} The project aims at running YOLO (v2 tiny) model inference on the BeagleBone AI at an improved rate of ~30 FPS by leveraging the on-board hardware accelerators and inference optimisations through TIDL API and TIDL Library. The model is to be converted and imported to suit the API requirements.

Student: Prashant Dandriyal
Mentors: Hunyue Yau
Code: https://github.com/PrashantDandriyal/GSoC2020_YOLOModelsOnTheX15
Wiki: https://elinux.org/BeagleBoard/GSoC/2020Proposal/PrashantDandriyal
GSoC entry

Status

This project is currently just a proposal.

Proposal

Completed the pre-requisites posted on the ideas page and have created a pull request to demonstrate the cross compilation #135.

About you

IRC: pradan
Github: PrashantDandriyal
School: Graphic Era University, Dehradun
Country: India
Primary language: English, Hindi
Typical work hours : 12PM-6PM IST
Previous GSoC participation: None. The aim of participation stays bringing inference to edge devices; bringing the computation to the data than the other way round.

About your project

Project name: YOLO models on the X15/AI

Description

In 10-20 sentences, what are you making, for whom, why and with what technologies (programming languages, etc.)?

The project objectives can be fulfilled by approaching two paths: either optimising the model (path 1) or the inference methods (path 2). We propose to follow both the path with limited scopes, as defined in the upcoming sections. The software stack is shown in the adjacent figure

Figure 1 : Software stack. Derived from [TIDL docs]

Texas Instruments provides an API: TIDL (Texas Instruments Deep Learning) which is to be used in either of the paths. For path 1, we intend to use the API for converting the darknet-19 based YOLO v2-tiny model into the intermediate format accepted by TIDL. We target only the YOLO v2-tiny model over the YOLO v3-tiny models as currently, not all layers are supported by the API. Also, the v2 model fits into the on-chip memory of the BeagleBone AI and the x15. The v2 is available in MobileNet and Caffe versions, both of which are supported for model import by TIDL. The Python implementation for a similar model import is implemented as follows:

python "tensorflow\python\tools\optimize_for_inference.py"  --input=mobilenet_v1_1.0_224_frozen.pb  --output=mobilenet_v1_1.0_224_final.pb --input_names=input  --output_names="MobilenetV1/Predictions/Softmax"

The model import will certainly produce satisfactory results, using the techniques employed in path 2 (discussed later). Post model-import, we leverage some optimisations on the converted model (.bin file):

  • Efficient CNN configuration through automatic layer combination during import process.
  • Introducing Sparsity and Quantization in models

The techniques are brought in using the 'configuration file' used during every model import.

For Path 2, we use the TIDL library (a part of TIDL API) to modify how the inference is made. The BeagleBone AI offers 4 Embedded Vision Engines (EVEs) and 2 C66x DSPs which help accelerate the frame processing by using multicore operation (with ~64MB memory buffer per core). Using these cores allows us to

  • Distribute the overload per frame using 'double buffering'
  • Distribute frames among cores
  • Distribute distribute network overload of each frame (better known as 'layer grouping')

We use these techniques in 2 approaches: 'approach_1' and 'approach_2' as explained in the demonstration section.

Demonstration

This section contains details of the demo created by me, to highlight the programming model of the proposed project.

1) Approach 1: One _Execution Object_ (EO) per frame with (Only EVEs)

Process 1 frame per EO or 1 per EOP (4 EVEs and 2 DSPs). This means 6 frames per EO. Above mentioned demo uses 2 EVEs + 2 DSPs (4 EOs) but not for distibuting frames but for layer grouping. Hence, the overall effect is that of a single frame at a time. This method doesn't leverage the layer grouping. The expected performance is 6x (10ms+2ms API overhead). The method is memory intensive beacause each EO is alloted input and output buffer individually. The source code is developed assuming pre-processed input data is available. In all other cases, OpenCV tools are readily available to do so.

Source Code: aproach_1

Network heap size : `64MB/EO x 6 EO = 384MB`

2) Approach 2: Two EO per frame using Double Buffering (EVEs+DSPs)

Figure 2 : Pipeline of EOs. Derived from [[1]]

The second approach is similar to the one adopted in the imagenet demo of TIDL, but the DSPs are replaced with additional EVEs. The pipelining used in the demo can be used to understand this approach also. For further detail,refer to the GitHub page of the demo. The TIDL device translation tool assigns layer group ids to layers during the translation process. But if the assignment fails to distribute the layers evenly, we use explicit grouping using the configuration file or the main cpp file. In this, for each frame, the first few layers (preferrably half of them) are grouped to be executed on EVE0 and the remaining half are grouped to run on EVE1. Similarly for the other frame on EVE3 and EVE 4. There are 4 EOs (4 EVEs and 0 DSPs) and 2 EOPs (Each EOP contains a pair of EVE). We process 1 frame per EOP, so 2 frames at a time. A good performance is expected due to the distribution of overload between the EVEs and use double buffering.

Source Code: aproach_2

Timeline

April 27 Pre-Work Community Bonding Period and discussion on the project and resources available.
May 25 Milestone #1
  • Introductory YouTube video
  • Involve with community and mentors and discuss the execution of the project if needed
  • Collect literature related to the TIDL API
June 1 Milestone #2
  • Setup working environment for development like TIDL SDK for the AM57x processor and IDE
  • Run TIDL demos provided in the API on the actual hardware (the BeagleBone AI) to validate the setup
June 19 Milestone #3
  • I have my exams around the first week of June (no exact dates are provided yet).
  • Port TensorFlow or Caffe versions of YOLO v2-tiny models using TIDL import feature
  • Test the converted model (.bin files) using simple inference on single frame
  • Demonstrate improved performance by using the configuration file to use sparse convolution
  • Repeat the process with fixed/dynamic quantization using the configuration file
  • Submit report for Phase 1 Evaluation
June 22 Milestone #4
  • Get reviews from mentors and discuss modifications for the project plan with the mentors
  • Finish any backlogs
  • Obtain equivalent model(s) using the model visualiser to cross check merged layers and better network configuration
June 29 Milestone #5 Begin with approach 1
  • Obtain image data in RAW format for test
  • Pre-process image data if using OpenCV tools in application
  • Create source file (cpp) for using all the 4 EVEs + 2 DSPs individually
  • Create an executable for the source files (if needed)
  • Compare with results without any accelerator and document
July 6 Milestone #6 Begin with approach 2
  • Create source file for using all the 4 EVEs through double buffering
  • Add layer grouping and determine optimum way of distributing the layers among the cores (EOs) heuristically
  • Create configuration file for it using the optimum configuration of the layer distribution determined in last step
  • Run test on pre-processed image data executable
July 13 Milestone #7
  • Run approach 1 and approach 2 tests on video data.
  • Process frames using OpenCV or directly feed pre-processed data
  • Modify the source files accordingly
  • Gather performance results and find (improved) FPS
July 17-20 Milestone #8
  • Submit second evaluation report
  • Discuss possible improvements with mentors
August 3 Milestone #9
  • Complete any backlog
  • Completion YouTube video
  • Detailed project tutorial
August 10 - 17 Final week
  • Get the final report reviewed by mentors and improvise changes advised
  • Submit final report

Experience and approach

In 5-15 sentences, convince us you will be able to successfully complete your project in the timeline you have described I am familiar with microcontrollers (32 bit) by Texas Instruments since my sophomore year of my Electronics and Communications Engineering. I have been a quarter-finalist in the India Innovation Challenge and Design Contest (IICDC-2018) during which our team was provided with Texas Instruments resources like the CC26X2R1 and TIVA LAUNCHPAD (EK-TM4C123GXL) EVM. I have been studying Machine Learning for about an year now. Mostly, I have used TensorFlow-Keras as the primary API. I have also participated in some of the ML competitions for a better exposure to problems. In my current semester, I have Neural Networks as a credit subject although I am already working on the topic in relation to On-device learning for low-computation-capable devices. I have implemented some simple neural networks in C and can be found in my GitHub account. I have studied digital signal processing as a credit subject and assume that it will also strengthen the foundation of my understanding of the convolutional neural networks used in this project. Regarding the languages, I have good experience of C and C++, both of which are he primary languages needed for this project. I have used C++ for several coding competitions held by Google.

The certainty of the project is validated by several factors. The project is similar to one of the demos provided in the TIDL API. The Single Shot multi-box detector demo uses a similar approach with the following differences:

TIDL SSD demo YOLO v2-tiny example
Input size (768 x 320) Input size (416 x 416)
43 Layers 19 Layers
Upto 20 classes Upto 80 classes
Uses caffe-Jacinto model Uses Darknet-19 based model

The output of both the examples is similar to SSD. The performance of the TIDL demo can be used to approximate the performance of the YOLO v2-tiny model. With overhead distribution between the EVEs and C66x, the frame processing time is around 170ms which fulfils our target of running an inference at < 1s. The selection of the YOLO v2 (tiny) model has been done after cross-checking the supported neural network layers. The effect of on-board accelerators have been validated by the articles and technical papers by Texas Instruments. Here are the findings of Peka Varis who ran Image Segmentation on the SITARA processor that we aim to use (AM5729); he observed a 30-40% decrese in latency per frame wwhen using the AM5729 when compared to the AM5749 and an improved rate of 45 fps. Further, the use of dense and sparse models, further enhances the performances dramatically.

Contingency

What will you do if you get stuck on your project and your mentor isn’t around? The set of resources / go-to-places I would use are as follows:

  • The TIDL documentation and the official guide for the processor SDK is the first-to-go resource. Along with that, TIDL programming model for programming reference, syntax and debugging issues
  • AM572X training series by Texas Instruments will be quite helpful in the beginning
  • TIDL API repository at git.ti
  • E2E forum for any unknown error/issue. I have already reached the experience of an intellectual on the forum.
  • Embedded Linux Classes by Mark A.Yoder on elinux. He even has detailed content on the x15 board.
  • Refer to the communities if the problem is related to BeagleBone boards. I have observed the open-source community (including the Google group) to be quite active.
  • In case the issue is related to the models, I will look for documentation of the related framework and its GitHub issues section if related to the models. The YOLO models have been around for a while now, so the support is quite good by now.

Benefit

If successfully completed, what will its impact be on the BeagleBoard.org community? Include quotes from BeagleBoard.org community members

With the completion of this project, the documentation will act as a beginner's guide to bring AI to the BeaglBone AI and (with some modifications) possibly to other BeagleBone boards too. For developers, the performance results will help in benchmarking the performances of similar edge devices. As the TIDL is still quite young, the observations born out this project will help collect issues and validations. Also, as the niche field of Edge AI is booming exponentially, the next GSoC projects can use the outcomes of this project to carry the work forward in different directions.

With less than 2 percentage point accuracy compromise, sparsification and TI’s EVE-optimized deep-learning network model JacintoNet11, it is possible to improve the inference latency even further. : Peka Varis in his [2] blog
One thing to note, if you are not using the TIDL API for your Vision AI apps, such as by porting over Raspberry Pi OpenCV code, then you are not using the accelerated TIDL hardware.  You're not gaining a thing. : sjmill01 in his article on element14

Misc

Completed the requirements listed on the ideas page. Link to pull request : #135..

Suggestions

Is there anything else we should have asked you?