TensorRT

NVIDIA TensorRT™ is a platform for high-performance deep learning inference. It includes a deep learning inference optimizer and runtime that delivers low latency and high-throughput for deep learning inference applications. TensorRT-based applications perform up to 40x faster than CPU-only platforms during inference. With TensorRT, you can optimize neural network models trained in all major frameworks, calibrate for lower precision with high accuracy, and finally deploy to hyperscale data centers, embedded, or automotive product platforms.

Introduction
developer.nvidia.com/tensorrt TensorRT Developer Guide

Official FAQ
TensorRT Developer Guide#FAQs

1. How to check TensorRT version?
There are two methods to check TensorRT version, $ nm -D /usr/lib//aarch64-linux-gnu/libnvinfer.so | grep "tensorrt" 0000000007849eb0 B tensorrt_build_svc_tensorrt_20181028_25152976 0000000007849eb4 B tensorrt_version_5_0_3_2
 * Symbols from library
 * NOTE: 20181028 is the build date and 25152976 is the top changelist and 5_0_3_2 is the version information.

$ cat /usr/include/aarch64-linux-gnu/NvInfer.h | grep "define NV_TENSORRT"
 * Macros from header file
 * 1) define NV_TENSORRT_MAJOR 5 //!< TensorRT major version.
 * 2) define NV_TENSORRT_MINOR 0 //!< TensorRT minor version.
 * 3) define NV_TENSORRT_PATCH 3 //!< TensorRT patch version.
 * 4) define NV_TENSORRT_BUILD 2 //!< TensorRT build number.
 * 5) define NV_TENSORRT_SONAME_MAJOR 5 //!< Shared object library major version number.
 * 6) define NV_TENSORRT_SONAME_MINOR 0 //!< Shared object library minor version number.
 * 7) define NV_TENSORRT_SONAME_PATCH 3 //!< Shared object library patch version number.

2. Whether TRT support thread-safe?
TensorRT runtime is thread-safe in the sense that parallel threads using different TRT Execution Contexts can execute in parallel without interference.

3. Can INT8 calibration table be compatible among different TRT versions or HW platforms?
INT8 calibration table is absolutely NOT compatible between different TRT versions. This is because the optimized network graph is probably different among various TRT versions. If you enforce to use them, TRT may not find the corresponding scaling factor for given tensor. As long as the installed TensorRT version is identical for different HW platforms, then the INT8 calibration table can be compatible. That means you can perform INT8 calibration on a faster computation platform, like V100 or T4 and then deploy the calibration table to Tegra for INT8 inferencing.

4. How to check GPU utilization?
On Tegra platform, we can use tegrastats to achieve that, $ sudo /home/nvidia/tegrastats On Desktop platform, like Tesla, we can use nvidia-smi to achieve that, $ nvidia-smi --format=csv -lms 500 --query-gpu=index,timestamp,utilization.gpu,clocks.current.graphics,clocks.current.sm,clocks.current.video, clocks.current.memory,utilization.memory,memory.total,memory.free,memory.used,power.limit,power.draw,temperature.gpu,fan.speed,compute_mode, gpu_operation_mode.current,clocks_throttle_reasons.active,pstate,clocks_throttle_reasons.hw_slowdown,clocks_throttle_reasons.gpu_idle, clocks_throttle_reasons.applications_clocks_setting,clocks_throttle_reasons.sw_power_cap,clocks_throttle_reasons.sync_boost -i 0 | tee log.csv

5. What is kernel auto-tuning?
TensorRT contains various kernel implementations, including those existing in CUDNN and CUBLAS, to accommodate diverse neural network configurations (batch, input/output dims, filters, strides, pads, dilation rate and etc). During network building, TensorRT will profile all suitable kernels and find out the best one with the smallest latency, and then mark it as the final tactic to run the certain layer. We call this process as kernel auto-tuning. Additionally, it’s not always true that INT8 kernel faster than FP16’s than FP32’s, so If current layer chooses different mode as its bottom layer or top layer, TensorRT will insert a reformatting layer between them to do the tensor format conversion, and the time for this reformatting layer will be taken into account as the cost of current layer during auto-tuning.
 * if you run FP16 precision mode, it profiles all candidates in FP16 kernel pool and FP32 kernel pool.
 * if you run INT8 precision mode, it profiles all candidates in INT8 kernel pool and FP32 kernel pool.
 * if both FP16 and INT8 are enabled (we call it hybrid mode), it profiles all candidate in INT8 kernel pool, FP16 kernel pool and FP32 kernel pool.

6. How TensorRT behave when different batch size is being used?
For relatively deep network, if GPU is fully occupied, we couldn't obtain much performance gain from batching. For relatively simper network, generally, GPU is not fully loaded, then we could obtain performance gain from batching. In other words, inference time per frame can be improved with the bigger batch size only if GPU loading is not full.

7. What is maxWorkspaceSize?
maxWorkspaceSize indicates a threshold to filter the kernels/tactics of which desired workspace size is less than maxWorkspaceSize. In other words, if the workspace size a tactic requires(such as for convolution, we can use cudnnGetConvolutionForwardWorkspaceSize to get the needed workspace size) is larger than what we specify to maxWorkspaceSize, then this tactic will be ignored during kernel auto-tuning. NOTE: the final device memory TRT consumes has nothing to do with maxWorkspacesize.

8. What is the difference between enqueue and execute?

 * Enqueue will need user to create and synchronize cudaStream_t. When it's being invoked, it will return immediately. While with Execute, TensorRT will create/synchronize steam internally, so it will return until everything is completed.
 * Enqueue and Execute are the functions with respect to the whole network/cudaEngine, other than specific layer. There is no enqueue inference in layer's implementation, so Layer only has execute interface, including the IPlugin layer.
 * Both enqueue and execute support profiling now. The time consumption of each layer will be printed when the whole execution is done, other than real-time profiling.
 * setDebugSync is only supported by execute. If this flag is true, the builder will synchronize (cudaDeviceSynchronize) after timing each layer and report the layer name.

9. How to dump the output of certain layer?
TensorRT doesn’t store the intermediate result of your network, so you have to use the following API to mark the intended layer as output layer, and then interference again and save its result for further analysis, network->markOutput(“layer_name”) NOTE:
 * You can set multiplier layers as the output at the same time, but setting the layer as output may break the network optimization and impact the inference performance, as TensorRT always runs output layer in FP32 mode, no matter which mode you have configured.
 * Don’t forget to adjust the dimension or output buffer size after you change the output layer.

10. How to analyze network performance?
First of all, we should be aware of the profiling command tool that TensorRT provides - trtexec. If all your network layer has been supported by TensorRT through either native way or plugin way, you can always utilize this tool to profile your network very quickly. Second, you can add profiling metrics for your application manually from CPU side (link) or GPU side (link). NOTE: Third, if you would like to scope the time consumption of each layer, you can implement IProfiler to achieve that, or utilize SimpleProfiler TensorRT already provides (refer to below patch for sampleSSD), --- sampleSSD.cpp.orig	2019-05-27 12:39:14.193521455 +0800 +++ sampleSSD.cpp	2019-05-27 12:38:59.393358775 +0800 @@ -428,8 +428,11 @@    float* detectionOut = new float[N * kKEEP_TOPK * 7]; int* keepCount = new int[N]; +   SimpleProfiler profiler (" layer time"); +   context->setProfiler(&profiler); // Run inference doInference(*context, data, detectionOut, keepCount, N); +   std::cout << profiler; bool pass = true;
 * Time collection should only contain the network enqueue or execute and any context set-up or memory initialization or refill operation should be excluded.
 * Add more iterations for the time collection, in order to avagage the GPU warm-up effect.

11. How to optimize network performance?
When you start optimizing your network performance, please ensure the methodology you use to profile your network is convincing (refer to question #12). We strongly recommend you to run your network in multi-batch mode, so that GPU computation resource can be fully exhausted. It’s always true to see a better performance when inferencing through multi-batch mode, unless your network is deeper or complicated enough to get GPU drained. TensorRT supports inferencing in FP16 or INT8 mode. Generally, the speed will become faster from FP32 to FP16 to INT8. For FP16, it’s very simple to enable it.
 * Multi-batch
 * Lower precision mode
 * For INT8, if you don’t care about the correctness or accuracy during network evaluation, you can simply use dummy dynamic range to get the network running in INT8,
 * For INT8, if you don’t care about the correctness or accuracy during network evaluation, you can simply use dummy dynamic range to get the network running in INT8,


 * NOTE: if you finally decide to choose INT8 as the deployment mode, you have to implement the ICalibrator or set proper dynamic range for your network.
 * If you find the performance for INT8 or FP16 is not significantly improved, don't panic, let’s break down the issue step by step,
 * Dump the per layer time and compare it between FP32 and FP16 or INT8.
 * Figure out which layer takes the bigger or most time consumption. If it’s FC layers, you probably need to enable hybrid mode (enable both FP16 and INT8),
 * If your network has many plugin layers or those plugin layer interlace through the whole network, TensorRT will insert many reformat layers to convert the data layout between normal layer and plugin layer. This reformat layer can be eliminated in some cases, for example, the network with PReLU (which can’t be supported by TensorRT 5.1 or prior versions). After the network training done, the slope parameter of PReLU can be converted to the hyperparameter negative slope of LeakyReLU (which can be native supported by TensorRT).
 * Generally, the speedup for lower precision mode mainly comes from convolution layer, if the total time of convolution layer takes a little part of your network inferencing time, it’s expected that lower precision FP16 or INT8 can’t help the network performance a lot. In this case, you should consider how to optimize those non-convolution layer or feed back to NVIDIA for any potential advice.
 * If your network has many plugin layers or those plugin layer interlace through the whole network, TensorRT will insert many reformat layers to convert the data layout between normal layer and plugin layer. This reformat layer can be eliminated in some cases, for example, the network with PReLU (which can’t be supported by TensorRT 5.1 or prior versions). After the network training done, the slope parameter of PReLU can be converted to the hyperparameter negative slope of LeakyReLU (which can be native supported by TensorRT).
 * Generally, the speedup for lower precision mode mainly comes from convolution layer, if the total time of convolution layer takes a little part of your network inferencing time, it’s expected that lower precision FP16 or INT8 can’t help the network performance a lot. In this case, you should consider how to optimize those non-convolution layer or feed back to NVIDIA for any potential advice.
 * If your network has many plugin layers or those plugin layer interlace through the whole network, TensorRT will insert many reformat layers to convert the data layout between normal layer and plugin layer. This reformat layer can be eliminated in some cases, for example, the network with PReLU (which can’t be supported by TensorRT 5.1 or prior versions). After the network training done, the slope parameter of PReLU can be converted to the hyperparameter negative slope of LeakyReLU (which can be native supported by TensorRT).
 * Generally, the speedup for lower precision mode mainly comes from convolution layer, if the total time of convolution layer takes a little part of your network inferencing time, it’s expected that lower precision FP16 or INT8 can’t help the network performance a lot. In this case, you should consider how to optimize those non-convolution layer or feed back to NVIDIA for any potential advice.

This is beyond what TensorRT could help, but this approach should be in mind also for network optimization, like network pruning way provided in NVIDIA TLT. Standing on GPU HW perspective, there are also some tricks when we design or prune the networks, for example, Tensor core on T4 or Xavier will be more friendly to these convolution cases of which channels are multiplier of 32 or 64. Hence this should give you a sense when you design or prune your feature extraction layers.
 * Network pruning

12. How to fix the error “Could not find scales for tensor xxxx” for INT8 mode?
Generally, after INT8 calibration is done, Int8Calibrator will save the scaling factors into a local file (through API writeCalibrationCache), so that it wouldn’t need to do calibration again for subsequent running and load the cached calibration table directly (through API readCalibrationCache). If you change the network or update the network or run the network among different GPU platforms or different TensorRT versions, then you may probably get the error “Could not find scales for tensor xxxx”, that indicates builder couldn’t find corresponding scaling factor from local cached calibration table. It’s intended since the network graph after fusion would change among different GPU platform or different TensorRT version or modification to network itself. The solution is very simple that removes the local calibration table and does calibration again.

13. How to enable VERBOSE log?
You can enable VERBOSE log for your sample through the following API, gLogger.setReportableSeverity(Severity::kVERBOSE) It’s always recommended to provide VERBOSE log to NVIDIA for further assistance or help.

14. How to fix INT8 accuracy issue?
Basically, you should be able to get an absolutely correct result for FP32 mode and roughly correct result for INT8 mode after calibration. Otherwise, if FP32 result is as expected, while INT8 result is totally messing, it’s probably due to the incorrect calibration. The IInt8Calibrator contains four virtual methods need to be implemented, as shown below, the most important and problematic one is getBatch, virtual int getBatchSize const = 0; virtual bool getBatch(void* bindings[], const char* names[], int nbBindings) = 0; virtual const void* readCalibrationCache(std::size_t& length) = 0; virtual void writeCalibrationCache(const void* ptr, std::size_t length) = 0; After you get a roughly correct result for INT8 mode, you can start evaluating its accuracy against the whole test dataset. If you get a poor classification or detection accuracy as opposed to FP32 mode (Q: which case can be treated as ‘poor’ result, for example, we are seeing within 1% INT8 accuracy loss for popular classification CNNs, like AlexNet, VGG19, Resnet50/101/152 and detection network, like VGG16_FasterRCNN_500x375, VGG16_SSD_300x300, if your accuracy loss is extremely larger than 1%, it might be the ‘poor’ case.), then we would suggest you to do the following check, virtual bool setDynamicRange(float min, float max) = 0; Additionally, someone might directly compare the value distribution of INT8 and FP32 for certain middle layer and surprisingly find out big discrepancy between them. It’s kind of expected, since TensorRT uses saturation quantization way and there are indeed values overflowed after INT8 computation. Actually, the relative entropy methodology for INT8 is to minimize the loss of information, and retain the final detection or classification accuracy loss. Even if you are seeing big deviation for middle layer activation, after propagation through the whole network, the final accuracy loss would probably decrease just a bit. Hence, it’s not convincing to use the value deviation between INT8 and FP32 to evaluate the INT8 accuracy.
 * Is the calibration input after preprocessing identical or not with the preprocessing of FP32 inferencing? If you are not sure about it, just compare the buff before feeding into TensorRT.
 * Is the calibration dataset enough or not? Ensure the calibration dataset is diverse and representative.
 * Is there any cached and incorrect calibration table being used or not?
 * Whether your network can be running in INT8 mode? Mostly training framework can also run the network in INT8 mode, so you should validate it out of TensorRT’s scope. Only when you ensure your model can be a goden model for INT8, then let’s consider how to deploy it appropriately through TensorRT INT8, otherwise, you can do nothing with TensorRT.
 * TensorRT does provide internal quantization way for customers to use. But it’s a post-training quantization way and expose less manipulation for users, so it can’t work for all the network cases. If your model is unluckily to be the case, then you should consider external quantization methodology and insert the dynamic range into TensorRT through the following API,

15. Can TensorRT model be compatible among different GPU platforms or TensorRT versions?
The answer is definitely NO. The reasons are as below,
 * Different GPU platforms have different SM counts, CUDA Cores or Tensor Core, or Memory efficiency, so same kernel may have different performance on these GPU platforms. That means TensorRT builder may choose different kernel/tactic implementation during network optimization even for the same input configuration.
 * Network fusion and layer kernels are evolving all the time,  so when different TensorRT versions to be used, the optimized network graph or chosen best tactic might be different. It’s possible that some chosen kernel for current TensorRT engine is not existing in the other TensorRT version.

16. Why we need clone for Plugin interface?
TensorRT API documentation provides an explanation about this. Simply, it intends to share immutable resources, like input/output dimensions, workspace ptr or weights and etc, among different execution contexts and these immutable resources or parameters are created per each engine, and copied over the invocation of clone. NOTE: the digit ‘0x12458e00’ is the plugin object ptr and can be used to track when it gets destroyed. parsing: CropAndResizePluginCreator::createPlugin --> 0x12458e00 CropAndResizePlugin::CropAndResizePlugin CropAndResizePlugin::setPluginNamespace CropAndResizePlugin::getNbOutputs CropAndResizePlugin::clone --> 0x12459300 CropAndResizePlugin::CropAndResizePlugin CropAndResizePlugin::getNbOutputs CropAndResizePlugin::getOutputDataType CropAndResizePlugin::isOutputBroadcastAcrossBatch CropAndResizePlugin::getOutputDimensions parsing done:
 * The following shows the callback flow of all IPlugin key APIs from network parsing, to engine building to inferencing (take reference from TensorRT 5.1.5).

engine building:   --> graph optimization, tactic selection and decide format CropAndResizePlugin::clone --> 0xf9e8e0 CropAndResizePlugin::CropAndResizePlugin CropAndResizePlugin::supportsFormat CropAndResizePlugin::getOutputDataType

CropAndResizePlugin::clone --> 0x6380f750 CropAndResizePlugin::CropAndResizePlugin CropAndResizePlugin::supportsFormat CropAndResizePlugin::configurePlugin CropAndResizePlugin::getWorkspaceSize CropAndResizePlugin::initialize CropAndResizePlugin::getSerializationSize CropAndResizePlugin::serialize CropAndResizePlugin::destroy --> 0xf9e8e0 CropAndResizePlugin::~CropAndResizePlugin CropAndResizePlugin::getSerializationSize CropAndResizePlugin::serialize CropAndResizePlugin::destroy --> 0x12458e00 CropAndResizePlugin::~CropAndResizePlugin CropAndResizePlugin::destroy --> 0x12459300 CropAndResizePlugin::~CropAndResizePlugin engine building done:

createExecutionContext and infer CropAndResizePlugin::clone --> 0x166b2840 CropAndResizePlugin::CropAndResizePlugin CropAndResizePlugin::attachToContext CropAndResizePlugin::enqueue ropAndResizePlugin::detachFromContext CropAndResizePlugin::terminate CropAndResizePlugin::destroy --> 0x166b2840 CropAndResizePlugin::~CropAndResizePlugin CropAndResizePlugin::terminate CropAndResizePlugin::destroy --> 0x6380f750 CropAndResizePlugin::~CropAndResizePlugin Finished Deserializing: CropAndResizePluginCreator::deserializePlugin --> 0xd10c5b0 CropAndResizePlugin::CropAndResizePlugin CropAndResizePlugin::initialize Deserializing done createExecutionContext and infer: CropAndResizePlugin::enqueue CropAndResizePlugin::terminate CropAndResizePlugin::destroy CropAndResizePlugin::~CropAndResizePlugin Finished NOTE: TensorRT 5.1 has a known issue that initialize and terminate don’t appear in pairs (terminate will be invoked twice by different objects). If you want to allocate immutable resource for your plugin in initialize, you better make it as std::shared_ptr so that it won’t get released twice from terminate.
 * The following shows the callback flow from engine/plan deserializing to inferencing (take reference from TensorRT 5.1.5).

17. How to fix FP16 accuracy issue?
The following is the data range of FP32, FP16 and INT8, Not like INT8, generally, we wouldn’t see overflow case (activation or weight larger than 65504 or less than -65504) for FP16 computation, but the underflow (less than 5.96e-8) would be still appearing compared to FP32 values. To debug FP16 accuracy analysis, we could dump the result of middle layer to scope whether FP16 activation value has big deviation compared to FP32’s. According to our experience, batch normalization and activation(Relu) can effectively decrease the information loss of FP16, like the following statistic we scoped from UNet semantic segmentation network, NOTE: If we want to dump FP16 result of the first layer, we have to set it as output layer, but setting certain layer as output probably causes TensorRT builder decides to run this layer in FP32, other than FP16 (it is probably due to the input and output both are FP32, if it runs FP16 computation, then it will need reformatting before and after, this reformat overhead might be larger than what we benefit from running FP16 mode). In this case, we shall use the following API to make the network run in FP16 mode strictly without considering any performance optimization, builder->setStrictTypeConstraints(true); Refer to the above result, we can see
 * Convolution FP16 does have 0.9% loss compared to FP32 result.
 * Batch normalization can help decrease the loss significantly from 0.9% to 0.014%.
 * Activation/Relu can also help (since negative overflow values get clipped to zero for both FP16 and FP32, so the loss will be decreasing by half ?).

18. What is maximum model size TensorRT can support?
TensorRT can support 2G(INT_MAX) model at most, which is also the hard limit of Google protobuf.

19. Why TRT has different definition for axis parameters as compared with the other frameworks, like Caffe or TensorFlow?
TensorRT treats batch as immutable dim over the whole network, so it mostly starts indexing from channel (as axis 0). When you specify axis 1 and hope for the processing from channel, while TensorRT might consider you mean starting from H. In a word, TensorRT layer deals with CHW other than NCHW. TensorRT parser code are open on GitHub, so you could check the parsing function there and understand how it handles your layer parameter exactly and fix the discrepancy accordingly (general solution is subtract 1 from the axis to strip out batch dimension)

TRT & YoloV3 FAQ
Refer to the page TensorRT/YoloV3