Difference between revisions of "TensorRT/CommonFAQ"

From eLinux.org
Jump to: navigation, search
m (How to check GPU utilization?)
(14 intermediate revisions by 2 users not shown)
Line 1: Line 1:
 +
 
===== <big> How to check TensorRT version?</big> =====
 
===== <big> How to check TensorRT version?</big> =====
 
There are two methods to check TensorRT version,
 
There are two methods to check TensorRT version,
 
* Symbols from library
 
* Symbols from library
 
<pre>
 
<pre>
$ nm -D /usr/lib//aarch64-linux-gnu/libnvinfer.so | grep "tensorrt"
+
$ nm -D /usr/lib/aarch64-linux-gnu/libnvinfer.so | grep "tensorrt"
 
0000000007849eb0 B tensorrt_build_svc_tensorrt_20181028_25152976
 
0000000007849eb0 B tensorrt_build_svc_tensorrt_20181028_25152976
 
0000000007849eb4 B tensorrt_version_5_0_3_2
 
0000000007849eb4 B tensorrt_version_5_0_3_2
Line 19: Line 20:
 
#define NV_TENSORRT_SONAME_PATCH 3 //!< Shared object library patch version number.
 
#define NV_TENSORRT_SONAME_PATCH 3 //!< Shared object library patch version number.
 
</pre>
 
</pre>
 +
 +
----
  
 
===== <big> Whether TRT support thread-safe?</big> =====
 
===== <big> Whether TRT support thread-safe?</big> =====
 
TensorRT runtime is thread-safe in the sense that parallel threads using different TRT Execution Contexts can execute in parallel without interference.
 
TensorRT runtime is thread-safe in the sense that parallel threads using different TRT Execution Contexts can execute in parallel without interference.
===== <big> Can INT8 calibration table be compatible among different TRT versions or HW platforms?</big> =====
 
INT8 calibration table is absolutely NOT compatible between different TRT versions. This is because the optimized network graph is probably different among various TRT versions. If you enforce to use them, TRT may not find the corresponding scaling factor for given tensor.<br>
 
As long as the installed TensorRT version is identical for different HW platforms,  then the INT8 calibration table can be compatible. That means you can perform INT8 calibration on a faster computation platform, like V100 or T4 and then deploy the calibration table to Tegra for INT8 inferencing.
 
  
 +
----
 
===== <big> How to check GPU utilization?</big> =====
 
===== <big> How to check GPU utilization?</big> =====
 
On Tegra platform, we can use tegrastats to achieve that,
 
On Tegra platform, we can use tegrastats to achieve that,
Line 33: Line 34:
 
On Desktop platform, like Tesla, we can use nvidia-smi to achieve that,
 
On Desktop platform, like Tesla, we can use nvidia-smi to achieve that,
 
<pre>
 
<pre>
$ nvidia-smi --format=csv -lms 500 --query-gpu=index,timestamp,utilization.gpu,clocks.current.graphics,clocks.current.sm,clocks.current.video,
+
$ nvidia-smi --format=csv -lms 500 --query-gpu=index,timestamp,utilization.gpu,clocks.current.graphics,clocks.current.sm,clocks.current.video,clocks.current.memory,utilization.memory,memory.total,memory.free,memory.used,power.limit,power.draw,temperature.gpu,fan.speed,compute_mode,gpu_operation_mode.current,clocks_throttle_reasons.active,pstate,clocks_throttle_reasons.hw_slowdown,clocks_throttle_reasons.gpu_idle,clocks_throttle_reasons.applications_clocks_setting,clocks_throttle_reasons.sw_power_cap,clocks_throttle_reasons.sync_boost -i 0 | tee log.csv
clocks.current.memory,utilization.memory,memory.total,memory.free,memory.used,power.limit,power.draw,temperature.gpu,fan.speed,compute_mode,
 
gpu_operation_mode.current,clocks_throttle_reasons.active,pstate,clocks_throttle_reasons.hw_slowdown,clocks_throttle_reasons.gpu_idle,
 
clocks_throttle_reasons.applications_clocks_setting,clocks_throttle_reasons.sw_power_cap,clocks_throttle_reasons.sync_boost -i 0 | tee log.csv
 
 
</pre>
 
</pre>
 +
----
  
 
===== <big> What is kernel auto-tuning?</big> =====
 
===== <big> What is kernel auto-tuning?</big> =====
Line 46: Line 45:
 
* if both FP16 and INT8 are enabled (we call it hybrid mode), it profiles all candidate in INT8 kernel pool, FP16 kernel pool and FP32 kernel pool.
 
* if both FP16 and INT8 are enabled (we call it hybrid mode), it profiles all candidate in INT8 kernel pool, FP16 kernel pool and FP32 kernel pool.
 
If current layer chooses different mode as its bottom layer or top layer, TensorRT will insert a reformatting layer between them to do the tensor format conversion, and the time for this reformatting layer will be taken into account as the cost of current layer during auto-tuning.
 
If current layer chooses different mode as its bottom layer or top layer, TensorRT will insert a reformatting layer between them to do the tensor format conversion, and the time for this reformatting layer will be taken into account as the cost of current layer during auto-tuning.
 +
----
 
===== <big> How TensorRT behave when different batch size is being used?</big> =====
 
===== <big> How TensorRT behave when different batch size is being used?</big> =====
 
For relatively deep network, if GPU is fully occupied, we couldn't obtain much performance gain from batching.<br>
 
For relatively deep network, if GPU is fully occupied, we couldn't obtain much performance gain from batching.<br>
 
For relatively simper network, generally, GPU is not fully loaded, then we could obtain performance gain from batching.<br>
 
For relatively simper network, generally, GPU is not fully loaded, then we could obtain performance gain from batching.<br>
 
In other words, inference time per frame can be improved with the bigger batch size only if GPU loading is not full.
 
In other words, inference time per frame can be improved with the bigger batch size only if GPU loading is not full.
 +
----
 
===== <big> What is maxWorkspaceSize?</big> =====
 
===== <big> What is maxWorkspaceSize?</big> =====
 
maxWorkspaceSize indicates a threshold to filter the kernels/tactics of which desired workspace size is less than maxWorkspaceSize. In other words, if the workspace size a tactic requires(such as for convolution, we can use cudnnGetConvolutionForwardWorkspaceSize()  to get the needed workspace size) is larger than what we specify to maxWorkspaceSize, then this tactic will be ignored during kernel auto-tuning.<br>
 
maxWorkspaceSize indicates a threshold to filter the kernels/tactics of which desired workspace size is less than maxWorkspaceSize. In other words, if the workspace size a tactic requires(such as for convolution, we can use cudnnGetConvolutionForwardWorkspaceSize()  to get the needed workspace size) is larger than what we specify to maxWorkspaceSize, then this tactic will be ignored during kernel auto-tuning.<br>
NOTE: the final device memory TRT consumes has nothing to do with maxWorkspacesize.  
+
NOTE: the final device memory TRT consumes has nothing to do with maxWorkspacesize.
 +
 
 +
----
 
===== <big> What is the difference between enqueue() and execute()?</big> =====
 
===== <big> What is the difference between enqueue() and execute()?</big> =====
* Enqueue will need user to create and synchronize cudaStream_t. When it's being invoked, it will return immediately. While with Execute, TensorRT will create/synchronize steam internally, so it will return until everything is completed.
+
* enqueue() will need user to create and synchronize cudaStream. When it's being invoked, it will submit task into GPU and return immediately on CPU basis. While with execute(), TensorRT will create and synchronize cudaStream internally, so execute() will return until everything is completed on GPU.
* Enqueue and Execute are the functions with respect to the whole network/cudaEngine, other than specific layer. There is no enqueue inference in layer's implementation, so Layer only has execute interface, including the IPlugin layer.  
+
* enqueue() and execute() are the APIs associated with network/execution Context, other than on layer basis (all layer implementations are in async manner, including IPlugin layer).  
* Both enqueue and execute support profiling now. The time consumption of each layer will be printed when the whole execution is done, other than real-time profiling.
+
* Both enqueue() and execute() support layer profiling. The time consumption of each layer will be recorded during network execution and printed out after all done, other than real-time profiling.
* setDebugSync is only supported by execute.  If this flag is true, the builder will synchronize (cudaDeviceSynchronize) after timing each layer and report the layer name.
+
* setDebugSync() is only supported along with execute().  If this flag is true, the builder or execution context will invoke cudaDeviceSynchronize() and report the status to user after each layer execution done.  
 +
 
 +
----
 +
 
 
===== <big> How to enable VERBOSE log?</big> =====
 
===== <big> How to enable VERBOSE log?</big> =====
You can enable VERBOSE log for your sample through the following API,
+
You can enable VERBOSE log for your sample through the following API provided by YOUR_TRT_PATH/sample/common/logging.h,
 
  gLogger.setReportableSeverity(Severity::kVERBOSE)
 
  gLogger.setReportableSeverity(Severity::kVERBOSE)
 
It’s always recommended to provide VERBOSE log to NVIDIA for further assistance or help.
 
It’s always recommended to provide VERBOSE log to NVIDIA for further assistance or help.
 +
----
  
 
===== <big> Can TensorRT model be compatible among different GPU platforms or TensorRT versions?</big> =====
 
===== <big> Can TensorRT model be compatible among different GPU platforms or TensorRT versions?</big> =====
Line 67: Line 74:
 
* Different GPU platforms have different SM counts, CUDA Cores or Tensor Core, or Memory efficiency, so same kernel may have different performance on these GPU platforms. That means TensorRT builder may choose different kernel/tactic implementation during network optimization even for the same input configuration.
 
* Different GPU platforms have different SM counts, CUDA Cores or Tensor Core, or Memory efficiency, so same kernel may have different performance on these GPU platforms. That means TensorRT builder may choose different kernel/tactic implementation during network optimization even for the same input configuration.
 
* Network fusion and layer kernels  are evolving all the time,  so when different TensorRT versions to be used, the optimized network graph or chosen best tactic might be different. It’s possible that some chosen kernel for current TensorRT engine is not existing in the other TensorRT version.
 
* Network fusion and layer kernels  are evolving all the time,  so when different TensorRT versions to be used, the optimized network graph or chosen best tactic might be different. It’s possible that some chosen kernel for current TensorRT engine is not existing in the other TensorRT version.
 +
----
  
 
===== <big> What is maximum model size TensorRT can support?</big> =====
 
===== <big> What is maximum model size TensorRT can support?</big> =====
 
TensorRT can support 2G(INT_MAX) model at most, which is also the hard limit of Google protobuf.
 
TensorRT can support 2G(INT_MAX) model at most, which is also the hard limit of Google protobuf.
 +
----
  
 
===== <big> Why TRT has different definition for axis parameters as compared with the other frameworks, like Caffe or TensorFlow?</big> =====
 
===== <big> Why TRT has different definition for axis parameters as compared with the other frameworks, like Caffe or TensorFlow?</big> =====
 
TensorRT treats batch as immutable dim over the whole network, so it mostly starts indexing from channel (as axis 0). When you specify axis 1 and hope for the processing from channel, while TensorRT might consider you mean starting from H. In a word, TensorRT layer deals with CHW other than NCHW. TensorRT parser code are open on GitHub, so you could check the parsing function there and understand how it handles your layer parameter exactly and fix the discrepancy accordingly (general solution is subtract 1 from the axis to strip out batch dimension)
 
TensorRT treats batch as immutable dim over the whole network, so it mostly starts indexing from channel (as axis 0). When you specify axis 1 and hope for the processing from channel, while TensorRT might consider you mean starting from H. In a word, TensorRT layer deals with CHW other than NCHW. TensorRT parser code are open on GitHub, so you could check the parsing function there and understand how it handles your layer parameter exactly and fix the discrepancy accordingly (general solution is subtract 1 from the axis to strip out batch dimension)

Revision as of 17:25, 14 August 2019

How to check TensorRT version?

There are two methods to check TensorRT version,

  • Symbols from library
$ nm -D /usr/lib/aarch64-linux-gnu/libnvinfer.so | grep "tensorrt"
0000000007849eb0 B tensorrt_build_svc_tensorrt_20181028_25152976
0000000007849eb4 B tensorrt_version_5_0_3_2
NOTE: 20181028 is the build date and 25152976 is the top changelist and 5_0_3_2 is the version information.
  • Macros from header file
$ cat /usr/include/aarch64-linux-gnu/NvInfer.h | grep "define NV_TENSORRT"
#define NV_TENSORRT_MAJOR 5 //!< TensorRT major version.
#define NV_TENSORRT_MINOR 0 //!< TensorRT minor version.
#define NV_TENSORRT_PATCH 3 //!< TensorRT patch version.
#define NV_TENSORRT_BUILD 2 //!< TensorRT build number.
#define NV_TENSORRT_SONAME_MAJOR 5 //!< Shared object library major version number.
#define NV_TENSORRT_SONAME_MINOR 0 //!< Shared object library minor version number.
#define NV_TENSORRT_SONAME_PATCH 3 //!< Shared object library patch version number.

Whether TRT support thread-safe?

TensorRT runtime is thread-safe in the sense that parallel threads using different TRT Execution Contexts can execute in parallel without interference.


How to check GPU utilization?

On Tegra platform, we can use tegrastats to achieve that,

$ sudo /home/nvidia/tegrastats

On Desktop platform, like Tesla, we can use nvidia-smi to achieve that,

$ nvidia-smi --format=csv -lms 500 --query-gpu=index,timestamp,utilization.gpu,clocks.current.graphics,clocks.current.sm,clocks.current.video,clocks.current.memory,utilization.memory,memory.total,memory.free,memory.used,power.limit,power.draw,temperature.gpu,fan.speed,compute_mode,gpu_operation_mode.current,clocks_throttle_reasons.active,pstate,clocks_throttle_reasons.hw_slowdown,clocks_throttle_reasons.gpu_idle,clocks_throttle_reasons.applications_clocks_setting,clocks_throttle_reasons.sw_power_cap,clocks_throttle_reasons.sync_boost -i 0 | tee log.csv

What is kernel auto-tuning?

TensorRT contains various kernel implementations, including those existing in CUDNN and CUBLAS, to accommodate diverse neural network configurations (batch, input/output dims, filters, strides, pads, dilation rate and etc). During network building, TensorRT will profile all suitable kernels and find out the best one with the smallest latency, and then mark it as the final tactic to run the certain layer. We call this process as kernel auto-tuning.
Additionally, it’s not always true that INT8 kernel faster than FP16’s than FP32’s, so

  • if you run FP16 precision mode, it profiles all candidates in FP16 kernel pool and FP32 kernel pool.
  • if you run INT8 precision mode, it profiles all candidates in INT8 kernel pool and FP32 kernel pool.
  • if both FP16 and INT8 are enabled (we call it hybrid mode), it profiles all candidate in INT8 kernel pool, FP16 kernel pool and FP32 kernel pool.

If current layer chooses different mode as its bottom layer or top layer, TensorRT will insert a reformatting layer between them to do the tensor format conversion, and the time for this reformatting layer will be taken into account as the cost of current layer during auto-tuning.


How TensorRT behave when different batch size is being used?

For relatively deep network, if GPU is fully occupied, we couldn't obtain much performance gain from batching.
For relatively simper network, generally, GPU is not fully loaded, then we could obtain performance gain from batching.
In other words, inference time per frame can be improved with the bigger batch size only if GPU loading is not full.


What is maxWorkspaceSize?

maxWorkspaceSize indicates a threshold to filter the kernels/tactics of which desired workspace size is less than maxWorkspaceSize. In other words, if the workspace size a tactic requires(such as for convolution, we can use cudnnGetConvolutionForwardWorkspaceSize() to get the needed workspace size) is larger than what we specify to maxWorkspaceSize, then this tactic will be ignored during kernel auto-tuning.
NOTE: the final device memory TRT consumes has nothing to do with maxWorkspacesize.


What is the difference between enqueue() and execute()?
  • enqueue() will need user to create and synchronize cudaStream. When it's being invoked, it will submit task into GPU and return immediately on CPU basis. While with execute(), TensorRT will create and synchronize cudaStream internally, so execute() will return until everything is completed on GPU.
  • enqueue() and execute() are the APIs associated with network/execution Context, other than on layer basis (all layer implementations are in async manner, including IPlugin layer).
  • Both enqueue() and execute() support layer profiling. The time consumption of each layer will be recorded during network execution and printed out after all done, other than real-time profiling.
  • setDebugSync() is only supported along with execute(). If this flag is true, the builder or execution context will invoke cudaDeviceSynchronize() and report the status to user after each layer execution done.

How to enable VERBOSE log?

You can enable VERBOSE log for your sample through the following API provided by YOUR_TRT_PATH/sample/common/logging.h,

gLogger.setReportableSeverity(Severity::kVERBOSE)

It’s always recommended to provide VERBOSE log to NVIDIA for further assistance or help.


Can TensorRT model be compatible among different GPU platforms or TensorRT versions?

The answer is definitely NO. The reasons are as below,

  • Different GPU platforms have different SM counts, CUDA Cores or Tensor Core, or Memory efficiency, so same kernel may have different performance on these GPU platforms. That means TensorRT builder may choose different kernel/tactic implementation during network optimization even for the same input configuration.
  • Network fusion and layer kernels are evolving all the time, so when different TensorRT versions to be used, the optimized network graph or chosen best tactic might be different. It’s possible that some chosen kernel for current TensorRT engine is not existing in the other TensorRT version.

What is maximum model size TensorRT can support?

TensorRT can support 2G(INT_MAX) model at most, which is also the hard limit of Google protobuf.


Why TRT has different definition for axis parameters as compared with the other frameworks, like Caffe or TensorFlow?

TensorRT treats batch as immutable dim over the whole network, so it mostly starts indexing from channel (as axis 0). When you specify axis 1 and hope for the processing from channel, while TensorRT might consider you mean starting from H. In a word, TensorRT layer deals with CHW other than NCHW. TensorRT parser code are open on GitHub, so you could check the parsing function there and understand how it handles your layer parameter exactly and fix the discrepancy accordingly (general solution is subtract 1 from the axis to strip out batch dimension)