Jetson/Computer Vision Performance

< Jetson(Redirected from Tegra/OpenCV Performance)
Jump to: navigation, search

Hardware Acceleration of OpenCV

OpenCV is the de-facto standard Computer Vision library containg more than 2500 computer vision & image processing & machine learning algorithms. See Installing OpenCV on Jetson TK1 if you haven't done so yet. OpenCV has been significantly optimized by NVIDIA in 2 ways:

  1. OpenCV4Tegra: A free library provided by NVIDIA containing optimizations for NVIDIA's Tegra CPUs (ARM NEON SIMD optimizations, multi-core CPU optimizations and some GLSL GPU optimizations). OpenCV4Tegra is a closed-source binary replacement for the public OpenCV, thus the programmer just writes regular OpenCV code, that will automatically take advantage of OpenCV4Tegra optimizations without the developer or user necessarily knowing about it. It is supported on Android since Tegra 2 and also supported on Linux4Tegra, Vibrante, etc. See OpenCV4Tegra Documentation for more info. It typically provides between 2x - 5x speedup on Tegra 3, Tegra 4 or Tegra K1 compared to regular OpenCV. The list of exact functions that are optimized by OpenCV4Tegra is mentioned in the documentation (currently here).
  2. OpenCV 'gpu' module: The 'gpu' module in the public OpenCV library is designed purely for CUDA GPGPU acceleration with NVIDIA's mobile & desktop GPUs. The developer must make minor changes to their code to specifically call functions from the OpenCV 'gpu' module in order for their OpenCV code to take advantage of the GPU. This allows the developer to control memory allocations for the GPU, choose when it is transferred between CPU & GPU, and choose which functions should run on GPU vs CPU and control the streaming or multi-GPU behaviour, etc. It has been an important part of OpenCV on desktop since 2010/2011, and is supported by most NVIDIA GPUs available today. Some functions (such as Haar Cascade Classifiers) are not as suited to GPUs so only get minor speedups or don't exist, while other functions (such as LBP Cascade Classifiers, HOG, stereo vision, warping, etc) are much more suited to GPUs and thus can get 5x - 20x speedups on Tegra K1 compared to regular OpenCV.

Presentation videos about the OpenCV4Tegra module

A free online webinar (on NVIDIA's GTC Express page) introduces the OpenCV4Tegra module, from the actual OpenCV4Tegra development team:

  1. Introduction to OpenCV for Tegra (March 2013) describes OpenCV4Tegra including the installation steps for Android. Video and Slides.

Presentation videos about the OpenCV 'gpu' module

Two free online webinars (on NVIDIA's GTC Express page) introduce OpenCV's GPU module, from the actual OpenCV development team:

  1. OpenCV - Accelerated Computer Vision using GPUs (June 2013) gives a non-technical overview of OpenCV and the GPU module, showing what is available and why you would want to use it. Video and Slides.
  2. Getting Started with GPU-accelerated Computer Vision using OpenCV and CUDA (July 2013) is more technical, it shows how you can install OpenCV's GPU module, shows the memory model of the GPU module, and how to combine OpenCV's GPU module with your own custom CUDA kernels. Video and Slides.

Power draw during computer vision tasks

The page Typical power draw of Jetson TK1 shows that the total power draw for Jetson TK1 is around 1.6W when idle (including the 0.4W fan) and is typically under 4W even when in moderate use. However, computer vision tasks are often able to push hardware to their limits, so this following section gives detailed power measurements for various computer vision programs.

This table covers various OpenCV sample CPU programs (in the "opencv-2.4.9/samples/cpp" folder), OpenCV sample GPU programs (in the "opencv-2.4.9/samples/gpu" folder), some VisionWorks sample CPU+GPU programs (freely available from NVIDIA), and some CUDA sample computer vision programs (in the "NVIDIA_CUDA6.0_Samples/3_Imaging" folder).

These are total power measurements for the whole Jetson TK1 board when running from 12V with the default fan attached (using 0.4W) without any customizations or power savings applied. The OpenCV samples were executed remotely through ethernet (where the Jetson TK1 was using 1.6W in between each of the OpenCV tests), while the VisionWorks and CUDA samples required a GPU-accelerated display and thus were executed through the Ubuntu Unity graphical desktop with a HDMI monitor and USB hub + keyboard & mouse attached (hence the Jetson TK1 was using 3.4W in between each of the VisionWorks & CUDA tests).

Sample code Library Processor Approximate power (Watts) for whole Jetson TK1 board Performance
bgfg_segm OpenCV CPU 2.8 ~7 FPS (MOG2 algorithm)
bgfg_segm OpenCV GPU 2.4 ~34 FPS (MOG2 algorithm)
bilateralFilter CUDA GPU 11.4 ~34 FPS @ 640x480
boxFilter CUDA GPU 7.0 ~23 FPS @ 1024x1024
brox_optical_flow OpenCV GPU 11.5
camshiftdemo OpenCV CPU 3.5
car detector VisionWorks GPU 10 ~5 FPS @ 720p (performing Soft Cascade Classifier)
farneback_optical_flow OpenCV CPU 5.0 ~0.24 FPS
farneback_optical_flow OpenCV GPU 10.8 ~0.46 FPS
feature_tracker VisionWorks GPU 6 ~40 FPS @ 720p (performing Color Conversion, Gaussian Pyramid, Pyramidal Optical Flow, plus Feature Tracking)
hog (people detection) OpenCV CPU 4.6 ~1.1 FPS
hog (people detection) OpenCV GPU 4.7 ~5.0 FPS
hough_lines VisionWorks GPU 7 ~30 FPS @ 720p (performing Color Conversion, Canny, plus Probabilistic Hough Lines)
imageDenoising CUDA GPU 6.0 ~150 FPS @ 320x408
kalman OpenCV CPU 2.0
letter_recog OpenCV CPU 4.9
meanshift_segmentation OpenCV CPU 4.7
motion_estimation VisionWorks GPU 6 ~20 FPS @ 720p (performing Color Conversion, plus IME Motion Estimation)
object_detector VisionWorks GPU 7 ~5 FPS @ 720p (performing Color Conversion, plus HOG Pedestrian Detection)
object_tracker VisionWorks GPU 6 ~60 FPS @ 720p (performing Color Conversion, Gaussian Pyramid, Forward Optical Flow, Backward Optical Flow, plus Median Flow)
pedestrian detector VisionWorks GPU 10 ~2 FPS @ 720p (performing Soft Cascade Classifier)
peopledetect OpenCV CPU 4.9
pyrlk_optical_flow OpenCV GPU 11.0
segment_objects OpenCV CPU 2.5
SLAM VisionWorks GPU 7 ~25 FPS @ 480p (performing SLAM)
SobelFilter CUDA GPU 4.6 ~150 FPS @ 512x512
stereo_match OpenCV CPU 2.4 ~3 FPS (BM algorithm)
stereo_match OpenCV GPU 3.4 ~24 FPS (BM algorithm)
videostab OpenCV CPU 4.9