Difference between revisions of "TensorRT"

From eLinux.org
Jump to: navigation, search
(FAQ)
(TRT Int8 Calibration FAQ)
 
(74 intermediate revisions by 3 users not shown)
Line 7: Line 7:
 
<br>
 
<br>
 
== FAQ ==
 
== FAQ ==
 +
=== Official FAQ ===
 +
[https://docs.nvidia.com/deeplearning/sdk/tensorrt-developer-guide/index.html#troubleshooting TensorRT Developer Guide#FAQs]<br>
  
===== <big>1. How to check TensorRT version?</big> =====
+
----
There are two methods to check TensorRT version,
+
=== Common FAQ ===
* Symbols from library
+
You can find answers here for some common questions about using TRT.<br>
<pre>
+
Refer to the page [https://elinux.org/TensorRT/CommonFAQ TensorRT/CommonFAQ]<br>
$ nm -D /usr/lib//aarch64-linux-gnu/libnvinfer.so | grep "tensorrt"
+
 
0000000007849eb0 B tensorrt_build_svc_tensorrt_20181028_25152976
+
----
0000000007849eb4 B tensorrt_version_5_0_3_2
+
=== TRT Accuracy FAQ ===
</pre>
+
If your FP16 result or Int8 result is not as expected, below page may help you fix the accuracy issues.<br>
NOTE: 20181028 is the build date and 25152976 is the top changelist and 5_0_3_2 is the version information.<br>
+
Refer to the page [https://elinux.org/TensorRT/AccuracyIssues TensorRT/AccuracyIssues]<br>
* Macros from header file
+
 
<pre>
+
----
$ cat /usr/include/aarch64-linux-gnu/NvInfer.h | grep "define NV_TENSORRT"
+
=== TRT Performance FAQ ===
#define NV_TENSORRT_MAJOR 5 //!< TensorRT major version.
+
If the performance of doing inference with TRT is not as expected, below page may help you to optimize the performance.<br>
#define NV_TENSORRT_MINOR 0 //!< TensorRT minor version.
+
Refer to the page [https://elinux.org/TensorRT/PerfIssues TensorRT/PerfIssues]<br>
#define NV_TENSORRT_PATCH 3 //!< TensorRT patch version.
+
 
#define NV_TENSORRT_BUILD 2 //!< TensorRT build number.
+
----
#define NV_TENSORRT_SONAME_MAJOR 5 //!< Shared object library major version number.
+
 
#define NV_TENSORRT_SONAME_MINOR 0 //!< Shared object library minor version number.
+
=== TRT Int8 Calibration FAQ ===
#define NV_TENSORRT_SONAME_PATCH 3 //!< Shared object library patch version number.
+
Below page will present some FAQs about TRT Int8 Calibration.<br>
</pre>
+
Refer to the page [https://elinux.org/TensorRT/Int8CFAQ  TensorRT/Int8CFAQ]<br>
===== <big>2. Whether TRT support thread-safe?</big> =====
+
 
TensorRT runtime is thread-safe in the sense that parallel threads using different TRT Execution Contexts can execute in parallel without interference.
+
----
===== <big>3. Can INT8 calibration table be compatible among different TRT version or HW platform?</big> =====
+
 
INT8 calibration table is absolutely NOT compatible between different TRT versions. This is because the optimized network graph is probably different among various TRT versions. If you enforce to use them, TRT may not find the corresponding scaling factor for given tensor.<br>
+
=== TRT Plugin FAQ ===
As long as the installed TensorRT version is identical for different HW platforms,  then the INT8 calibration table can be compatible. That means you can perform INT8 calibration on a faster computation platform, like V100 or P4 and then deploy the calibration table to Tegra for INT8 inferencing.  
+
Below page will present some FAQs about TRT Plugin.<br>
===== <big>4. How to check GPU utilization?</big> =====
+
Refer to the page [https://elinux.org/TensorRT/PluginFAQ  TensorRT/PluginFAQ]<br>
On Tegra platform, we can use tegrastats to achieve that,
+
 
<pre>
+
----
$ sudo /home/nvidia/tegrastats
+
=== How to fix some Common Errors ===
</pre>
+
If you met some Errors during using TRT, please find from below page for the answer.<br>
On Desktop platform, like Tesla, we can use nvidia-smi to achieve that,
+
Refer to the page [https://elinux.org/TensorRT/CommonErrorFix TensorRT/CommonErrorFix]<br>
<pre>
+
 
$ nvidia-smi --format=csv -lms 500 --query-gpu=index,timestamp,utilization.gpu,clocks.current.graphics,clocks.current.sm,clocks.current.video,clocks.current.memory,utilization.memory,memory.total,memory.free,memory.used,power.limit,power.draw,temperature.gpu,fan.speed,compute_mode,gpu_operation_mode.current,clocks_throttle_reasons.active,pstate,clocks_throttle_reasons.hw_slowdown,clocks_throttle_reasons.gpu_idle,clocks_throttle_reasons.applications_clocks_setting,clocks_throttle_reasons.sw_power_cap,clocks_throttle_reasons.sync_boost -i 0 | tee log.cs
+
----
</pre>
+
=== How to debug or analyze ===  
===== <big>5. What is kernel auto-tuning?</big> =====
+
below page will help you debugging your inferencing in some ways.<br>
TensorRT contains various kernel implementations, including those existing in CUDNN and CUBLAS, to accommodate diverse neural network configurations (batch, input/output dims, filters, strides, pads, dilation rate and etc). During network building, TensorRT will profile all suitable kernels and find out the best one with the smallest latency, and then mark it as the final tactic to run the certain layer. We call this process as kernel auto-tuning.<br>
+
Refer to the page [https://elinux.org/TensorRT/How2Debug TensorRT/How2Debug]<br>
Additionally, it’s not always true that INT8 kernel faster than FP16’s than FP32’s, so
+
 
* if you run FP16 precision mode, it profiles all candidates in FP16 kernel pool and FP32 kernel pool.
+
----
* if you run INT8 precision mode, it profiles all candidates in INT8 kernel pool and FP32 kernel pool.
+
=== TRT & YoloV3 FAQ ===
* if both FP16 and INT8 are enabled (we call it hybrid mode), it profiles all candidate in INT8 kernel pool, FP16 kernel pool and FP32 kernel pool.
+
Refer to the page [https://elinux.org/TensorRT/YoloV3 TensorRT/YoloV3]<br>
If current layer chooses different mode as its bottom layer or top layer, TensorRT will insert a reformatting layer between them to do the tensor format conversion, and the time for this reformatting layer will be taken into account as the cost of current layer during auto-tuning.
+
 
===== <big>6. How TensorRT behave when different batch size is being used?</big> =====
+
----
For relatively deep network, if GPU is fully occupied, we couldn't obtain much performance gain from batching.<br>
+
== Known issue ==
For relatively simper network, generally, GPU is not fully loaded, then we could obtain performance gain from batching.<br>
+
Here we list all the known issue that has been clarified in different TensorRT versions.
In other words, inference time per frame can be improved with the bigger batch size only if GPU loading is not full.
 
===== <big>7. What is maxWorkspaceSize?</big> =====
 
maxWorkspaceSize indicates a threshold to filter the kernels/tactics of which desired workspace size is less than maxWorkspaceSize. In other words, if the workspace size a tactic requires(such as for convolution, we can use cudnnGetConvolutionForwardWorkspaceSize()  to get the needed workspace size) is larger than what we specify to maxWorkspaceSize, then this tactic will be ignored during kernel auto-tuning.<br>
 
NOTE: the final device memory TRT consumes has nothing to do with maxWorkspacesize.
 
===== <big>8. What is the difference between enqueue() and execute()?</big> =====
 
* Enqueue will need user to create and synchronize cudaStream_t. When it's being invoked, it will return immediately. While with Execute, TensorRT will create/synchronize steam internally, so it will return until everything is completed.
 
* Enqueue and Execute are the functions with respect to the whole network/cudaEngine, other than specific layer. There is no enqueue inference in layer's implementation, so Layer only has execute interface, including the IPlugin layer.
 
* Both enqueue and execute support profiling now. The time consumption of each layer will be printed when the whole execution is done, other than real-time profiling.
 
* setDebugSync is only supported by execute.  If this flag is true, the builder will synchronize (cudaDeviceSynchronize) after timing each layer and report the layer name.
 
===== <big>9. How to dump the output of certain layer?</big> =====
 
TensorRT doesn’t store the intermediate result of your network, so you have to use the following API to mark the intended layer as output layer, and then interference again and save its result for further analysis,
 
<pre>
 
network->markOutput(“layer_name”)
 
</pre>
 
NOTE:
 
* You can set multiplier layers as the output at the same time, but setting the layer as output may break the network optimization and impact the inference performance, as TensorRT always runs output layer in FP32 mode, no matter which mode you have configured.
 
* Don’t forget to adjust the dimension or output buffer size after you change the output layer.
 
===== <big>10. How to analyze network performance?</big> =====
 
First of all, we should be aware of the profiling command tool that TensorRT provides - [https://docs.nvidia.com/deeplearning/sdk/tensorrt-developer-guide/index.html#trtexec trtexec].<br>
 
If all your network layer has been supported by TensorRT through either native way or plugin way, you can always utilize this tool to profile your network very quickly. <br>
 
Second, you can add profiling metrics for your application manually from CPU side ([https://docs.nvidia.com/deeplearning/sdk/tensorrt-best-practices/index.html#cpu-timing link]) or GPU side ([https://docs.nvidia.com/deeplearning/sdk/tensorrt-best-practices/index.html#cuda-events link]). <br>
 
NOTE:
 
* Time collection should only contain the network enqueue() or execute() and any context set-up or memory initialization or refill operation should be excluded.
 
* Add more iterations for the time collection, in order to avagage the GPU warm-up effect.<br>
 
Third, if you would like to scope the time consumption of each layer, you can implement IProfiler to achieve that, or utilize SimpleProfiler TensorRT already provides (refer to below patch for sampleSSD),
 
<pre>
 
--- sampleSSD.cpp.orig 2019-05-27 12:39:14.193521455 +0800
 
+++ sampleSSD.cpp 2019-05-27 12:38:59.393358775 +0800
 
@@ -428,8 +428,11 @@
 
    float* detectionOut = new float[N * kKEEP_TOPK * 7];
 
    int* keepCount = new int[N];
 
 
+    SimpleProfiler profiler (" layer time");
 
+    context->setProfiler(&profiler);
 
    // Run inference
 
    doInference(*context, data, detectionOut, keepCount, N);
 
+    std::cout << profiler;
 
 
    bool pass = true;
 
</pre>
 
===== <big>11. How to optimize network performance?</big> =====
 
When you start optimizing your network performance, please ensure the methodology you use to profile your network is convincing (refer to question #12).<br>
 
* '''Multi-batch''' <br />We strongly recommend you to run your network in multi-batch mode, so that GPU computation resource can be fully exhausted. It’s always true to see a better performance when inferencing through multi-batch mode, unless your network is deeper or complicated enough to get GPU drained.
 
* '''Lower precision mode'''<br />TensorRT supports inferencing in FP16 or INT8 mode. Generally, the speed will become faster from FP32 to FP16 to INT8. <br />For FP16, it’s very simple to enable it.<pre>builder->setFp16Mode(true);</pre><br />For INT8, if you don’t care about the correctness or accuracy during network evaluation, you can simply use dummy dynamic range to get the network running in INT8,<pre>samplesCommon::setAllTensorScales(network.get())</pre><pre>builder->setInt8Mode(true);</pre><br />NOTE: if you finally decide to choose INT8 as the deployment mode, you have to implement the ICalibrator or set proper dynamic range for your network. <br>If you find the performance for INT8 or FP16 is not significantly improved, don't panic, let’s break down the issue step by step,
 
** Dump the per layer time and compare it between FP32 and FP16 or INT8.
 
** Figure out which layer takes the bigger or most time consumption. If it’s FC layers, you probably need to enable hybrid mode (enable both FP16 and INT8),<pre>builder->setFp16Mode(true);</pre><pre>builder->setInt8Mode(true);</pre><pre>builder->setInt8Calibrator(&calibrator); </pre><pre>// or samplesCommon::setAllTensorScales(network.get())</pre>
 
** If your network has many plugin layers or those plugin layer interlace through the whole network, TensorRT will insert many reformat layers to convert the data layout between normal layer and plugin layer. This reformat layer can be eliminated in some cases, for example, the network with PReLU (which can’t be supported by TensorRT 5.1 or prior versions). After the network training done, the slope parameter of PReLU can be converted to the hyperparameter negative slope of LeakyReLU (which can be native supported by TensorRT). 
 
** Generally, the speedup for lower precision mode mainly comes from convolution layer, if the total time of convolution layer takes a little part of your network inferencing time, it’s expected that lower precision FP16 or INT8 can’t help the network performance a lot. In this case, you should consider how to optimize those non-convolution layer or feed back to NVIDIA for any potential advice.
 
* '''Network pruning'''<br />This is beyond what TensorRT could help, but this approach should be in mind also for network optimization, like network pruning way provided in NVIDIA TLT. <br />Standing on GPU HW perspective, there are also some tricks when we design or prune the networks, for example, Tensor core on T4 or Xavier will be more friendly to these convolution cases of which channels are multiplier of 32 or 64. Hence this should give you a sense when you design or prune your feature extraction layers.
 
===== <big>12. How to fix the error “Could not find scales for tensor xxxx” for INT8 mode?</big> =====
 
Generally, after INT8 calibration is done, Int8Calibrator will save the scaling factors into a local file (through API writeCalibrationCache), so that it wouldn’t need to do calibration again for subsequent running and load the cached calibration table directly (through API readCalibrationCache).<br>
 
If you change the network or update the network or run the network among different GPU platforms or different TensorRT versions, then you may probably get the error “Could not find scales for tensor xxxx”, that indicates builder couldn’t find corresponding scaling factor from local cached calibration table. It’s intended since the network graph after fusion would change among different GPU platform or different TensorRT version or modification to network itself. The solution is very simple that removes the local calibration table and does calibration again.
 

Latest revision as of 01:49, 14 August 2019

NVIDIA TensorRT™ is a platform for high-performance deep learning inference. It includes a deep learning inference optimizer and runtime that delivers low latency and high-throughput for deep learning inference applications. TensorRT-based applications perform up to 40x faster than CPU-only platforms during inference. With TensorRT, you can optimize neural network models trained in all major frameworks, calibrate for lower precision with high accuracy, and finally deploy to hyperscale data centers, embedded, or automotive product platforms.

Introduction

developer.nvidia.com/tensorrt
TensorRT Developer Guide

FAQ

Official FAQ

TensorRT Developer Guide#FAQs


Common FAQ

You can find answers here for some common questions about using TRT.
Refer to the page TensorRT/CommonFAQ


TRT Accuracy FAQ

If your FP16 result or Int8 result is not as expected, below page may help you fix the accuracy issues.
Refer to the page TensorRT/AccuracyIssues


TRT Performance FAQ

If the performance of doing inference with TRT is not as expected, below page may help you to optimize the performance.
Refer to the page TensorRT/PerfIssues


TRT Int8 Calibration FAQ

Below page will present some FAQs about TRT Int8 Calibration.
Refer to the page TensorRT/Int8CFAQ


TRT Plugin FAQ

Below page will present some FAQs about TRT Plugin.
Refer to the page TensorRT/PluginFAQ


How to fix some Common Errors

If you met some Errors during using TRT, please find from below page for the answer.
Refer to the page TensorRT/CommonErrorFix


How to debug or analyze

below page will help you debugging your inferencing in some ways.
Refer to the page TensorRT/How2Debug


TRT & YoloV3 FAQ

Refer to the page TensorRT/YoloV3


Known issue

Here we list all the known issue that has been clarified in different TensorRT versions.