BeagleBoard/GSoC/2022 Proposal/Running Machine Learning Models on Bela

=Running Machine Learning Models on Bela= About Student: Ezra Pierce Mentors: Jack Armitage, Victor Shepardson Proposal:

=Proposal= All requirements listed on the ideas page have been completed, PR for cross compilation task can be found here.

=Status= This project is currently just a proposal.

About you
Github: ezrapierce000 School: [Carleton University] Country: Canada Primary language : English Typical work hours: 9AM-6PM Eastern Standard Time Previous GSoC participation: This would be my first time participating in GSoC.

About your project
Project name: Running Machine Learning Models on Bela

Introduction
The goal of this project is to improve the tooling surrounding embedded machine learning on the BeagleBone Black(BBB)/Bela to aid its community in experimenting with machine learning applications for their projects. The specific developer tools chosen for this project are an inference benchmarking tool as well as a perf-based profiler developed for the BBB/Bela platform.

Bela is platform built upon the BeagleBone Black and consists of an audio cape and a custom real-time Linux image using Xenomai. This platform provides a low-latency computing environment ideal for use in audio applications.

The usage of machine learning in instrument design has grown in recent years, yet there has not been many implementations in more resource-constrained embedded contexts like the BBB/Bela. The tools developed during this project aim to improve the workflow of those interested in exploring embedded machine learning and reduce some of the barriers that come with it. The benchmarking tool will be used to take latency, memory and accuracy measurements, meant to be used when comparing different ML runtime components and/or compilers. The profiler will be used to pinpoint bottlenecks during model development, allowing developers to discover slow operators and view CPU utilization. This project will also build up a model zoo for the BBB/Bela and build some example projects.

ML Stack
This project will focus on a specific modeling langauge (PyTorch) and platform (BBB+Bela). In between there are a number of potential model formats, compilers/runtimes frontends, and backend components. The analysis tools built during this project will aim to support multiple runtime frontends and backends to allow developers to compare performance results between them.

Summary of stack:


 * Modeling language: pytorch. (+tensorflow for converting to tflite)
 * Model format: ONNX, torchscript, (+tflite)
 * Runtime frontends: libtorch, ONNX runtime, SOFIE, (+tflite)
 * Runtime backend components: ArmNN, XNNPack, eigen, BLAS
 * OS + Hardware: Bela + BBB.

Some NN compiler projects will also be audited for potential BBB support:


 * torch-MLIR (https://github.com/llvm/torch-mlir)
 * plaid (https://plaidml.github.io/plaidml/)
 * glow (https://github.com/pytorch/glow)
 * NNC (https://dev-discuss.pytorch.org/t/nnc-walkthrough-how-pytorch-ops-get-fused/125)
 * Apache TVM (https://tvm.apache.org/)
 * IREE (https://google.github.io/iree/)

Benchmarking Tool
This project will provide both a benchmarking tool and a profiling tool to be used to evaluate machine learning models on the BBB/Bela. The benchmarking tool will provide the following measurements: This will be done by providing a common frontend for the pre-existing frontends listed above, allowing developers to chose which runtime components they would like to test. This common frontend will be used to take latency measurements at each inference, while the benchmarking tool is also sampling the memory usage concurrently from a separate thread to allow for average and maximum memory measurements. The benchmarking tool should also allow for developers to provide test data for accuracy measurements.
 * Average latency
 * Maximum latency
 * Average memory usage
 * Maximum memory usage
 * Accuracy

The benchmarking tool on the BBB/Bela will be written in C++ with a simple Python tool on the host PC for communication between the developer's PC and the BBB/Bela.



Profiling Tool
The profiling tool will aim to provide a GUI interface for the display of CPU cycles per function call, thread utilization and the call stack. This tool will be built around the perf Linux utility, which is a statistical profiler based on CPU performance counters. To provide a more intuitive interface, this project will build a simple local webserver (similar to the Bela IDE or perhaps integrated into the Bela IDE) that will display the data captured in a visual form. This will be done using the pprof profiling visualizer and the perf_data_converter tool. As an alternative, the perf-based hotspot tool will also be evaluated for use in this project. This tool will have to be run in a linux thread as opposed to a real-time Xenomai thread but the results should still be applicable for supporting model optimization work. Optimizations can then be tested with the benchmarking tool in a real-time Xenomai thread.

Example flamegraph from the hotspotprofiling tool.

Model Selection
These tools will allow for the streamlining of model evaluation on the BBB/Bela platform. During the project some models will be built to be tested with these performance tools and evaluate their feasibility for some of the targeted use cases such as gesture recognition, audio synthesis and control mapping.

Proposed model architectures to be evaluated:


 * 1D convolutional network (for audio or sensor streams)
 * memory-cell RNN (e.g. GRU) (for audio or sensor streams)
 * MLP (general purpose)
 * Transformer block (general purpose)
 * mixture-density head (with many elementwise, shape and reduction ops; +with RNG in graph)

Example Projects
As an additional goal, if time permits, this project will also develop some exemplary projects in Python and for Bela.

Experience and approach
Through coursework and multiple co-op terms in industry, I've gained experience relevant to this project such as:
 * Benchmarked hardware peripherals on an embedded linux system (RPi CM4) and TI C2000 platform for high-speed binary data transfer
 * Developed features and fixed bugs in C for embedded Linux TCP server used for sensor data acquisition
 * Built multiple Python testing systems for various software systems and hardware calibration protocols
 * Completed labs in an Intro to Machine Learning course using Keras to build, train and test models
 * Designed and implemented audio plugin in C++ for translating audio data into haptic signals in real-time
 * Designed and implemented firmware for the Pi Pico in C++ and CircuitPython to interface with different connected modules using I2S, SPI & PWM peripherals

Contingency
If I come to any roadblocks during this project I'll first talk with my mentors to brainstorm potential solutions. If they happen to not be available I'll reach out for help from the community on various platforms such as the BBB Slack chat, the Bela forum, the iil.is Discord or the PyTorch forum.

While writing this proposal I have also amassed some resources that may be useful during the project:
 * MLPerf™ Tiny Deep Learning Benchmarks for Embedded Devices
 * Installing C++ Distributions of PyTorch
 * Xenomai docs
 * EdgeAI TIDL tools and examples
 * hotspot and heaptrack tools
 * C++ Real-Time Audio Programming with Bela
 * DeepLearningForBela
 * "Tuning C++: Benchmarks, and CPUs, and Compilers! Oh My!"
 * Various papers on machine learning in musical instrument design

Benefit
This project will provide multiple benefits. Firstly, it will give some base benchmark measurements for various machine learning model architectures on the BBB/Bela which will help developers decide which models could be worth investigating for their use cases. Secondly, it will provide tools for developers to benchmark and profile new models for their BBB/Bela. Thirdly, it will provide some example projects for those looking to get started using machine learning in their Bela projects.