Difference between revisions of "TensorRT"

From eLinux.org
Jump to: navigation, search
(FAQ)
m (Introduction)
 
(83 intermediate revisions by 3 users not shown)
Line 1: Line 1:
 
NVIDIA TensorRT™ is a platform for high-performance deep learning inference. It includes a deep learning inference optimizer and runtime that delivers low latency and high-throughput for deep learning inference applications. TensorRT-based applications perform up to 40x faster than CPU-only platforms during inference. With TensorRT, you can optimize neural network models trained in all major frameworks, calibrate for lower precision with high accuracy, and finally deploy to hyperscale data centers, embedded, or automotive product platforms.
 
NVIDIA TensorRT™ is a platform for high-performance deep learning inference. It includes a deep learning inference optimizer and runtime that delivers low latency and high-throughput for deep learning inference applications. TensorRT-based applications perform up to 40x faster than CPU-only platforms during inference. With TensorRT, you can optimize neural network models trained in all major frameworks, calibrate for lower precision with high accuracy, and finally deploy to hyperscale data centers, embedded, or automotive product platforms.
 
<br>
 
<br>
 +
 
== Introduction ==
 
== Introduction ==
  
[https://developer.nvidia.com/tensorrt developer.nvidia.com/tensorrt ]<br>
+
 
[https://docs.nvidia.com/deeplearning/sdk/tensorrt-developer-guide/index.html TensorRT Developer Guide ]
+
[https://developer.nvidia.com/tensorrt TensorRT Download]<br>
 +
[https://docs.nvidia.com/deeplearning/sdk/tensorrt-developer-guide/index.html TensorRT Developer Guide]
 
<br>
 
<br>
 +
 
== FAQ ==
 
== FAQ ==
  
===== <big>1. How to check TensorRT version?</big> =====
+
 
There are two methods to check TensorRT version,
+
=== Official FAQ ===
* Symbols from library
+
[https://docs.nvidia.com/deeplearning/sdk/tensorrt-developer-guide/index.html#troubleshooting TensorRT Developer Guide#FAQs]<br>
<pre>
+
 
$ nm -D /usr/lib//aarch64-linux-gnu/libnvinfer.so | grep "tensorrt"
+
 
0000000007849eb0 B tensorrt_build_svc_tensorrt_20181028_25152976
+
----
0000000007849eb4 B tensorrt_version_5_0_3_2
+
=== Common FAQ ===
</pre>
+
You can find answers here for some common questions about using TRT.<br>
NOTE: 20181028 is the build date and 25152976 is the top changelist and 5_0_3_2 is the version information.<br>
+
Refer to the page [https://elinux.org/TensorRT/CommonFAQ TensorRT/CommonFAQ]<br>
* Macros from header file
+
 
<pre>
+
 
$ cat /usr/include/aarch64-linux-gnu/NvInfer.h | grep "define NV_TENSORRT"
+
----
#define NV_TENSORRT_MAJOR 5 //!< TensorRT major version.
+
=== TRT Accuracy FAQ ===
#define NV_TENSORRT_MINOR 0 //!< TensorRT minor version.
+
If your FP16 result or Int8 result is not as expected, below page may help you fix the accuracy issues.<br>
#define NV_TENSORRT_PATCH 3 //!< TensorRT patch version.
+
Refer to the page [https://elinux.org/TensorRT/AccuracyIssues TensorRT/AccuracyIssues]<br>
#define NV_TENSORRT_BUILD 2 //!< TensorRT build number.
+
 
#define NV_TENSORRT_SONAME_MAJOR 5 //!< Shared object library major version number.
+
 
#define NV_TENSORRT_SONAME_MINOR 0 //!< Shared object library minor version number.
+
----
#define NV_TENSORRT_SONAME_PATCH 3 //!< Shared object library patch version number.
+
=== TRT Performance FAQ ===
</pre>
+
If the performance of doing inference with TRT is not as expected, below page may help you to optimize the performance.<br>
===== <big>2. Whether TRT support thread-safe?</big> =====
+
Refer to the page [https://elinux.org/TensorRT/PerfIssues TensorRT/PerfIssues]<br>
TensorRT runtime is thread-safe in the sense that parallel threads using different TRT Execution Contexts can execute in parallel without interference.
+
 
===== <big>3. Can INT8 calibration table be compatible among different TRT version or HW platform?</big> =====
+
 
INT8 calibration table is absolutely NOT compatible between different TRT versions. This is because the optimized network graph is probably different among various TRT versions. If you enforce to use them, TRT may not find the corresponding scaling factor for given tensor.<br>
+
----
As long as the installed TensorRT version is identical for different HW platforms,  then the INT8 calibration table can be compatible. That means you can perform INT8 calibration on a faster computation platform, like V100 or P4 and then deploy the calibration table to Tegra for INT8 inferencing.
+
 
===== <big>4. How to check GPU utilization?</big> =====
+
=== TRT Int8 Calibration FAQ ===
On Tegra platform, we can use tegrastats to achieve that,
+
Below page will present some FAQs about TRT Int8 Calibration.<br>
<pre>
+
Refer to the page [https://elinux.org/TensorRT/Int8CFAQ  TensorRT/Int8CFAQ]<br>
$ sudo /home/nvidia/tegrastats
+
 
</pre>
+
 
On Desktop platform, like Tesla, we can use nvidia-smi to achieve that,
+
----
<pre>
+
 
$ nvidia-smi --format=csv -lms 500 --query-gpu=index,timestamp,utilization.gpu,clocks.current.graphics,clocks.current.sm,clocks.current.video,clocks.current.memory,utilization.memory,memory.total,memory.free,memory.used,power.limit,power.draw,temperature.gpu,fan.speed,compute_mode,gpu_operation_mode.current,clocks_throttle_reasons.active,pstate,clocks_throttle_reasons.hw_slowdown,clocks_throttle_reasons.gpu_idle,clocks_throttle_reasons.applications_clocks_setting,clocks_throttle_reasons.sw_power_cap,clocks_throttle_reasons.sync_boost -i 0 | tee log.cs
+
=== TRT Plugin FAQ ===
</pre>
+
Below page will present some FAQs about TRT Plugin.<br>
===== <big>5. What is kernel auto-tuning?</big> =====
+
Refer to the page [https://elinux.org/TensorRT/PluginFAQ  TensorRT/PluginFAQ]<br>
TensorRT contains various kernel implementations, including those existing in CUDNN and CUBLAS, to accommodate diverse neural network configurations (batch, input/output dims, filters, strides, pads, dilation rate and etc). During network building, TensorRT will profile all suitable kernels and find out the best one with the smallest latency, and then mark it as the final tactic to run the certain layer. We call this process as kernel auto-tuning.
+
 
Additionally, it’s not always true that INT8 kernel faster than FP16’s than FP32’s, so
+
 
* if you run FP16 precision mode, it profiles all candidates in FP16 kernel pool and FP32 kernel pool.
+
----
* if you run INT8 precision mode, it profiles all candidates in INT8 kernel pool and FP32 kernel pool.
+
=== How to fix some Common Errors ===
* if both FP16 and INT8 are enabled (we call it hybrid mode), it profiles all candidate in INT8 kernel pool, FP16 kernel pool and FP32 kernel pool.
+
If you met some Errors during using TRT, please find from below page for the answer.<br>
If current layer chooses different mode as its bottom layer or top layer, TensorRT will insert a reformatting layer between them to do the tensor format conversion, and the time for this reformatting layer will be taken into account as the cost of current layer during auto-tuning.
+
Refer to the page [https://elinux.org/TensorRT/CommonErrorFix TensorRT/CommonErrorFix]<br>
===== <big>6. How TensorRT behave when different batch size is being used?</big> =====
+
 
For relatively deep network, if GPU is fully occupied, we couldn't obtain much performance gain from batching.
+
 
For relatively simper network, generally, GPU is not fully loaded, then we could obtain performance gain from batching.
+
----
In other words, inference time per frame can be improved with the bigger batch size only if GPU loading is not full.
+
=== How to debug or analyze ===
===== <big>7. What is maxWorkspaceSize?</big> =====
+
below page will help you debugging your inferencing in some ways.<br>
maxWorkspaceSize indicates a threshold to filter the kernels/tactics of which desired workspace size is less than maxWorkspaceSize. In other words, if the workspace size a tactic requires(such as for convolution, we can use cudnnGetConvolutionForwardWorkspaceSize()  to get the needed workspace size) is larger than what we specify to maxWorkspaceSize, then this tactic will be ignored during kernel auto-tuning.
+
Refer to the page [https://elinux.org/TensorRT/How2Debug TensorRT/How2Debug]<br>
NOTE: the final device memory TRT consumes has nothing to do with maxWorkspacesize.
+
 
===== <big>8. What is the difference between enqueue() and execute()?</big> =====
+
 
* Enqueue will need user to create and synchronize cudaStream_t. When it's being invoked, it will return immediately. While with Execute, TensorRT will create/synchronize steam internally, so it will return until everything is completed.
+
----
* Enqueue and Execute are the functions with respect to the whole network/cudaEngine, other than specific layer. There is no enqueue inference in layer's implementation, so Layer only has execute interface, including the IPlugin layer.
+
=== TRT & YoloV3 FAQ ===
* Both enqueue and execute support profiling now. The time consumption of each layer will be printed when the whole execution is done, other than real-time profiling.
+
Refer to the page [https://elinux.org/TensorRT/YoloV3 TensorRT/YoloV3]<br>
* setDebugSync is only supported by execute.  If this flag is true, the builder will synchronize (cudaDeviceSynchronize) after timing each layer and report the layer name.
+
 
 +
 
 +
----

Latest revision as of 20:54, 14 October 2019

NVIDIA TensorRT™ is a platform for high-performance deep learning inference. It includes a deep learning inference optimizer and runtime that delivers low latency and high-throughput for deep learning inference applications. TensorRT-based applications perform up to 40x faster than CPU-only platforms during inference. With TensorRT, you can optimize neural network models trained in all major frameworks, calibrate for lower precision with high accuracy, and finally deploy to hyperscale data centers, embedded, or automotive product platforms.

Introduction

TensorRT Download
TensorRT Developer Guide

FAQ

Official FAQ

TensorRT Developer Guide#FAQs



Common FAQ

You can find answers here for some common questions about using TRT.
Refer to the page TensorRT/CommonFAQ



TRT Accuracy FAQ

If your FP16 result or Int8 result is not as expected, below page may help you fix the accuracy issues.
Refer to the page TensorRT/AccuracyIssues



TRT Performance FAQ

If the performance of doing inference with TRT is not as expected, below page may help you to optimize the performance.
Refer to the page TensorRT/PerfIssues



TRT Int8 Calibration FAQ

Below page will present some FAQs about TRT Int8 Calibration.
Refer to the page TensorRT/Int8CFAQ



TRT Plugin FAQ

Below page will present some FAQs about TRT Plugin.
Refer to the page TensorRT/PluginFAQ



How to fix some Common Errors

If you met some Errors during using TRT, please find from below page for the answer.
Refer to the page TensorRT/CommonErrorFix



How to debug or analyze

below page will help you debugging your inferencing in some ways.
Refer to the page TensorRT/How2Debug



TRT & YoloV3 FAQ

Refer to the page TensorRT/YoloV3