TensorRT/PerfIssues

How to optimize network performance?
When you start optimizing your network performance, please ensure the methodology you use to profile your network is convincing (refer to question #12). We strongly recommend you to run your network in multi-batch mode, so that GPU computation resource can be fully exhausted. It’s always true to see a better performance when inferencing through multi-batch mode, unless your network is deeper or complicated enough to get GPU drained. TensorRT supports inferencing in FP16 or INT8 mode. Generally, the speed will become faster from FP32 to FP16 to INT8. For FP16, it’s very simple to enable it.
 * Multi-batch
 * Lower precision mode
 * For INT8, if you don’t care about the correctness or accuracy during network evaluation, you can simply use dummy dynamic range to get the network running in INT8,
 * For INT8, if you don’t care about the correctness or accuracy during network evaluation, you can simply use dummy dynamic range to get the network running in INT8,


 * NOTE: if you finally decide to choose INT8 as the deployment mode, you have to implement the ICalibrator or set proper dynamic range for your network.
 * If you find the performance for INT8 or FP16 is not significantly improved, don't panic, let’s break down the issue step by step,
 * Dump the per layer time and compare it between FP32 and FP16 or INT8.
 * Figure out which layer takes the bigger or most time consumption. If it’s FC layers, you probably need to enable hybrid mode (enable both FP16 and INT8),
 * If your network has many plugin layers or those plugin layers interlace through the whole network, TensorRT will insert many reformat layers to convert the data layout between normal layer and plugin layer. This reformat layer can be eliminated in some cases, for example, the network with PReLU (which can’t be supported by TensorRT 5.1 or prior versions). User could consider to replace PRelu with Leaky Relu which is the native layer if this wouldn't decrease the accuracy a lot.
 * Generally, the speedup for lower precision mode mainly comes from convolution layer, if the total time of convolution layer takes a little part of your network inferencing time, it’s expected that lower precision FP16 or INT8 can’t help the network performance a lot. In this case, you should consider how to optimize those non-convolution layer or feed back to NVIDIA for any potential advice.
 * If your network has many plugin layers or those plugin layers interlace through the whole network, TensorRT will insert many reformat layers to convert the data layout between normal layer and plugin layer. This reformat layer can be eliminated in some cases, for example, the network with PReLU (which can’t be supported by TensorRT 5.1 or prior versions). User could consider to replace PRelu with Leaky Relu which is the native layer if this wouldn't decrease the accuracy a lot.
 * Generally, the speedup for lower precision mode mainly comes from convolution layer, if the total time of convolution layer takes a little part of your network inferencing time, it’s expected that lower precision FP16 or INT8 can’t help the network performance a lot. In this case, you should consider how to optimize those non-convolution layer or feed back to NVIDIA for any potential advice.
 * If your network has many plugin layers or those plugin layers interlace through the whole network, TensorRT will insert many reformat layers to convert the data layout between normal layer and plugin layer. This reformat layer can be eliminated in some cases, for example, the network with PReLU (which can’t be supported by TensorRT 5.1 or prior versions). User could consider to replace PRelu with Leaky Relu which is the native layer if this wouldn't decrease the accuracy a lot.
 * Generally, the speedup for lower precision mode mainly comes from convolution layer, if the total time of convolution layer takes a little part of your network inferencing time, it’s expected that lower precision FP16 or INT8 can’t help the network performance a lot. In this case, you should consider how to optimize those non-convolution layer or feed back to NVIDIA for any potential advice.

This is beyond what TensorRT could help, but this approach should be in mind also for network optimization, like network pruning way provided in NVIDIA TLT. Standing on GPU HW perspective, there are also some tricks when we design or prune the networks, for example, Tensor core on T4 or Xavier will be more friendly to these convolution cases of which channels are multiplier of 32 or 64. Hence this should give you a sense when you design or prune your feature extraction layers.
 * Network pruning