BeagleBoard/GSoC/2022 Proposal/Running Machine Learning Models on Bela

=Running Machine Learning Models on Bela= Student: Ezra Pierce Mentors: Jack Armitage, Victor Shepardson Proposal:

Proposal
All requirements listed on the ideas page have been completed, PR for cross compilation task can be found here.

Status
This project is currently just a proposal.

About you
Github: ezrapierce000 School: [Carleton University] Country: Canada Primary language : English Typical work hours: 9AM-6PM Eastern Standard Time Previous GSoC participation: This would be my first time participating in GSoC.

About your project
Project name: Running Machine Learning Models on Bela

Introduction
The goal of this project is to improve the tooling surrounding embedded machine learning on the BeagleBone Black(BBB)/Bela to aid its community in experimenting with machine learning applications for their projects. The specific developer tools chosen for this project are an inference benchmarking tool as well as a perf-based profiler developed for the BBB/Bela platform.

Bela is a platform built upon the BeagleBone Black, consisting of an audio cape and a custom real-time Linux image using the Xenomai framework. This platform provides a low-latency computing environment ideal for use in audio applications. There already exists a large community surrounding the Bela, as it is an increasingly popular platform for use in educational settings as well as musical instrument design and maker communities. This project aims to extend the Bela platform to include tools and documentation for machine learning projects, with the goal of simplifying the process of integrating machine learning models into real-time embedded Bela projects. As the Bela platform has been adopted by a wide range of users, from artists to engineers, this project will aim to provide tooling that caters to this broad userbase.

The usage of machine learning in instrument design has grown in recent years, yet there have not been many implementations in more resource-constrained embedded contexts like the BBB/Bela. This can be attributed to the fact that machine learning can be very computationally expensive, with many typical applications requiring GPUs, TPUs or other custom hardware accelerators. Although, with the growing industry interest in edge computing, there have been increasing numbers of projects looking to optimize the whole machine learning pipeline for embedded devices, such as TinyML, TFLite and many others. This project aims to leverage tools such as these to give Bela users the ability to deploy ML models to their devices.

One of the current challenges in doing so is the real-time nature of audio projects on the Bela, which is a key factor when developing instruments or interactive sensor systems. This imposes a latency requirement on any models being run. This strict latency requirement implies the need for performance analysis tools that can evaluate and measure ML models, providing feedback to the user on the runtime costs incurred by their models. Thus, this project's main focus will be the development of performance analysis tools for running machine learning models on the Bela. This will come in the form of a benchmarking tool and a profiling tool. The benchmarking tool will be used to take latency, memory and accuracy measurements, meant to be used when comparing different ML runtime components, model architectures and/or compilers. The profiler will be used to pinpoint bottlenecks during model development, allowing developers to discover slow operators and view CPU utilization. In addition to these tools, this project will also aim to build some exemplary projects that document the setup of ML projects on the BBB/Bela, providing a starting point to users looking to explore this space. During development of both tools, a focus will be put on keeping all code portable to allow for use on future BeagleBoard/Bela platforms.

ML Stack
This project will focus on a specific modeling langauge (PyTorch) and platform (BBB+Bela). In between there are a number of potential model formats, compilers/runtimes frontends, and backend components. The analysis tools built during this project will aim to support multiple runtime frontends and backends to allow developers to compare performance results between them.

Summary of stack:


 * Modeling language: pytorch. (+tensorflow for converting to tflite)
 * Model format: ONNX, torchscript, (+tflite)
 * Runtime frontends: libtorch, ONNX runtime, SOFIE, (+tflite)
 * Runtime backend components: ArmNN, XNNPack, eigen, BLAS
 * OS + Hardware: Bela + BBB.

Some NN compiler projects will also be audited for potential BBB support:


 * torch-MLIR (https://github.com/llvm/torch-mlir)
 * plaid (https://plaidml.github.io/plaidml/)
 * glow (https://github.com/pytorch/glow)
 * NNC (https://dev-discuss.pytorch.org/t/nnc-walkthrough-how-pytorch-ops-get-fused/125)
 * Apache TVM (https://tvm.apache.org/)
 * IREE (https://google.github.io/iree/)

Benchmarking Tool
This project will provide both a benchmarking tool and a profiling tool to be used to evaluate machine learning models on the BBB/Bela. The benchmarking tool will provide the following measurements: This will be done by providing a common frontend for the pre-existing frontends listed above, allowing developers to chose which runtime components they would like to test. This common frontend will be used to take latency measurements at each inference, while the benchmarking tool is also sampling the memory usage concurrently from a separate thread to allow for average and maximum memory measurements. The benchmarking tool should also allow for developers to provide test data for accuracy measurements. The tool will also facilitate the loading of a model from the developer's host PC to the BBB/Bela, supporting both torchscript and ONNX models.
 * Average latency
 * Maximum latency
 * Latency jitter
 * Average memory usage
 * Maximum memory usage
 * Accuracy

The benchmarking tool on the BBB/Bela will be written in C++ with a simple Python tool on the host PC for model loading and communication between the developer's PC and the BBB/Bela.



Profiling Tool
The profiling tool will aim to provide a GUI interface for the display of CPU cycles per function call, thread utilization and the call stack. This tool will be built around the perf Linux utility, which is a statistical profiler based on CPU performance counters. To provide a more intuitive interface, this project will build a simple local webserver (similar to the Bela IDE or perhaps integrated into the Bela IDE) that will display the data captured in a visual form. This will be done using the pprof profiling visualizer and the perf_data_converter tool. As an alternative, the perf-based hotspot tool will also be evaluated for use in this project. This tool will have to be run in a linux thread as opposed to a real-time Xenomai thread but the results should still be applicable for supporting model optimization work. Optimizations can then be tested with the benchmarking tool in a real-time Xenomai thread.

Example flamegraph from the hotspotprofiling tool.

Model Selection
Much work on embedded ML focuses on a narrow application domain (such as image classification) and a small set of canonical model architectures. But Digital Musical Instrument (DMI) design demands creative ways of processing data of various shapes and modalities, from high sample rate audio streams to heterogeneous sensor inputs. Rather than optimize for a particular method or use-case, this project will aim for wide coverage of PyTorch operators and fine-grained visibility into their performance. To that end, evaluation will focus on composable neural network blocks which are relevant to ML tasks like sequence modeling, classification and variational autoencoding (the building-blocks of musical applications like gesture recognition, audio synthesis and control mapping). These will include:


 * matrix-vector product (at various sizes, with quantization, sparsity)
 * 1D convolutional networks (with groups and dilation)
 * memory-cell RNNs (LSTM, GRU)
 * multi-layer perceptrons (with various activation functions)
 * transformer blocks (dynamic input sizes, batch normalization)
 * mixture-density heads (testing various elementwise, shape and reduction ops)
 * reparameterized sampling and KL divergence for normal distributions

Example Projects
As an additional goal, if time permits, this project will also develop some example projects in Python and for Bela. These projects will serve as a learning tool for people looking to explore embedded ML on the BBB/Bela. They will cover the installation, configuration and use of the relevant tools as well as provide example code for building, training and running ML models on the BBB/Bela. They will also of course provide documentation for the use of the tools developed during this project. Based on the preliminary results recorded during this project, more specific example projects could be developed to target some potential use cases such as gesture recognition and mapping, neural audio synthesis or dimensionality reduction of incoming sensor data. These more targeted example projects would be valuable in providing concrete, applicable examples for the Bela community, inspiring new ideas and further development.

The development of example projects will also allow for the refinement of the performance analysis tools as it may inform the development of new features when used in real-world practice.

Experience and approach
Through coursework and multiple co-op terms in industry, I've gained experience relevant to this project such as:
 * Benchmarked hardware peripherals on an embedded linux system (RPi CM4) and TI C2000 platform for high-speed binary data transfer
 * Developed features and fixed bugs in C for embedded Linux TCP server used for sensor data acquisition
 * Built multiple Python testing systems for various software systems and hardware calibration protocols
 * Completed labs in an Intro to Machine Learning course using Keras to build, train and test models
 * Designed and implemented an audio plugin in C++ for translating audio data into haptic signals in real-time
 * Designed and implemented firmware for the Pi Pico in C++ and CircuitPython to interface with different connected modules using I2S, SPI & PWM peripherals

I believe the skills outlined above give me a strong technical base to draw from during this project, yet I am sure to come across new challenges and gaps in my knowledge during this project. To help navigate this, weekly meetings will be conducted with my mentors and I to discuss progress made on the weekly milestones as well as new ideas or roadblocks. Both mentors have significant domain expertise in relevant areas such as: Bela development, ML research and instrument design, making them a good source of advice and support during this project.

Contingency
If more support is needed during this project I'll reach out for help from various involved communities such as the BBB Slack chat, the Bela forum, the iil.is Discord or the PyTorch forum.

While writing this proposal I have also amassed some resources that may be useful during the project:
 * MLPerf™ Tiny Deep Learning Benchmarks for Embedded Devices
 * Installing C++ Distributions of PyTorch
 * Action-Sound Latency: Are Our Tools Fast Enough?
 * Xenomai docs
 * EdgeAI TIDL tools and examples
 * hotspot and heaptrack tools
 * C++ Real-Time Audio Programming with Bela
 * DeepLearningForBela
 * "Tuning C++: Benchmarks, and CPUs, and Compilers! Oh My!"
 * Various papers on machine learning in musical instrument design

Benefit
This project will provide multiple benefits. Firstly, it will improve the development ecosystem surrounding the BBB/Bela by providing a new tool to measure the performance of different models. This will help those researching ML for use in embedded musical instrument design speed up their iteration cycle by providing measurements directly from the target hardware. In tandem with this benefit, the project will also provide researchers and developers with the ability to dive deeper down into the details of their implementation and examine the potential bottlenecks on a CPU-cycle by CPU-cycle basis. This will greatly improve the understanding of what types of model architectures could be possible on this platform, maximize the available computational resources on the BBB and motivate future optimization work. Finally, this project will improve access to embedded ML on the BBB/Bela and potential future platforms like the BBAI. This will benefit instrument designers, artists and makers by providing them with example projects and documented tools, enabling new explorations of the applications of embedded machine learning.

~ Victor Shepardson

~ Jack Armitage

Misc
All requirements listed on the ideas page have been completed, PR for cross compilation task can be found here.