Difference between revisions of "TensorRT/How2Debug"

From eLinux.org
Jump to: navigation, search
m (How to know which layers are running on what precision (INT8/FP16/FP32)?)
 
Line 75: Line 75:
 
     // Run inference
 
     // Run inference
 
     doInference(*context, data, detectionOut, keepCount, N);
 
     doInference(*context, data, detectionOut, keepCount, N);
+    std::cout << profiler;
+
+    gLogInfo << profiler;
 
   
 
   
 
     bool pass = true;
 
     bool pass = true;

Latest revision as of 02:26, 29 November 2019


Layer Dump and Analyze

Refer to the old page or new page


How to dump the output of certain layer?

TensorRT doesn’t store the intermediate result of your network, so you have to use the following API to mark the intended layer as output layer, and then interference again and save its result for further analysis,

network->markOutput(“layer_name”)

NOTE:

  • You can set multiplier layers as the output at the same time, but setting the layer as output may break the network optimization and impact the inference performance, as TensorRT always runs output layer in FP32 mode, no matter which mode you have configured.
  • Don’t forget to adjust the dimension or output buffer size after you change the output layer.

How to debug ONNX model with setting extra output layer?

Sometimes we need to debug our model with dumping output of middle layer, this FAQ will show you a way to set middle layer as output for debugging ONNX model.
The below steps are setting one middle layer of mnist.onnx model as output using the patch shown at the bottom.

  1. Download onnx-tensorrt and mnist.onnx
  2. Get all nodes info: Apply the first section "dump all nodes' output" change and build onx2trt.then run the command to get all nodes:
    $ ./onnx2trt mnist.onnx -o mnist.engine
  3. Set one layer as output: Pick up the node name from the output of step2, and set it as output with the 2nd section "set one layer as output" change, rebuild onx2trt, and run below command to regenerate the engine.
    $ ./onnx2trt mnist.onnx -o mnist.engine
  4. Dump output with the engine file:
    $ ./trtexec  --engine=mnist.engine --input=Input3 --output=Plus214_Output_0 --output=Convolution110_Output_0 --dumpOutput

Here is the patch based on onnx-tensorrt

diff --git a/ModelImporter.cpp b/ModelImporter.cpp
index ac4749c..8638add 100644
--- a/ModelImporter.cpp
+++ b/ModelImporter.cpp
@@ -524,6 +524,19 @@ ModelImporter::importModel(::ONNX_NAMESPACE::ModelProto const &model,
     output_names.push_back(model.graph().output(i).name());
   }
+  // ======= dump all nodes' output ============
+  int node_size = graph.node_size();
+  cout << "ModelImporter::importModel : graph.node_size() = " << node_size << " *******" << endl;
+  for (int i = 0; i < graph.node_size(); i++) {
+         ::ONNX_NAMESPACE::NodeProto const& node = graph.node(i);
+         if( node.output().size() > 0 ) {
+                 cout << "node[" << i << "] = "
+                        << node.output(0) << ":"
+                        << node.op_type() << endl;
+         }
+  }
+  // =========================================
+
   string_map<TensorOrWeights> tensors;
   TRT_CHECK(importInputs(&_importer_ctx, graph, &tensors, weight_count,
                          weight_descriptors));
@@ -559,10 +572,17 @@ ModelImporter::importModel(::ONNX_NAMESPACE::ModelProto const &model,
     }
   }
   _current_node = -1;
+
+  // =========== set one layer as output, below "Convolution110_Output_0" is from abobe dump ==
+  nvinfer1::ITensor* new_output_tensor_ptr = &tensors.at("Convolution110_Output_0").tensor();
+  _importer_ctx.network()->markOutput(*new_output_tensor_ptr);
+  // ==========================================================================================
+

How to analyze network performance?

First of all, we should be aware of the profiling command tool that TensorRT provides - trtexec.
If all your network layer has been supported by TensorRT through either native way or plugin way, you can always utilize this tool to profile your network very quickly.
Second, you can add profiling metrics for your application manually from CPU side (link) or GPU side (link).
NOTE:

  • Time collection should only contain the network enqueue() or execute() and any context set-up or memory initialization or refill operation should be excluded.
  • Add more iterations for the time collection, in order to avagage the GPU warm-up effect.

Third, if you would like to scope the time consumption of each layer, you can implement IProfiler to achieve that, or utilize SimpleProfiler TensorRT already provides (refer to below patch for sampleSSD),

--- sampleSSD.cpp.orig	2019-05-27 12:39:14.193521455 +0800
+++ sampleSSD.cpp	2019-05-27 12:38:59.393358775 +0800
@@ -428,8 +428,11 @@
     float* detectionOut = new float[N * kKEEP_TOPK * 7];
     int* keepCount = new int[N];
 
+    SimpleProfiler profiler (" layer time");
+    context->setProfiler(&profiler);
     // Run inference
     doInference(*context, data, detectionOut, keepCount, N);
+    gLogInfo << profiler;
 
     bool pass = true;

How to know which layers are running on what precision (INT8/FP16/FP32)?

TensorRT will determine the final precision mode of each layer according to the following aspects,

  • Whether this layer has any kernel implementation existing for user specified mode, like softmax layer, it doesn't support running INT8 mode, so we have to fall back to FP32 even when user configure INT8.
  • If there are many implementations, which one is the fastest from latency perspective? Like for INT8, we have CUDA core based INT8 kernel and Tensor core based INT8 kernel, but Tensor core based kernel requires the channel numbers to be aligned by 32. E.g. the general first convolution, of which input channel is 3, so we have to pad it to 32, in this case, much computation is wasted (29/32). As a result, CUDA core based INT8 (require aligned by 4) might be more faster.
  • Whether there is any reformatting require on the front or back, like if current layer is the output layer and it requires FP32 output, if we run it in INT8 mode, we may have some performance benefit, but a reformatting cost will be needed to convert INT8 result to FP32.
  • Whether user specify strict type constraint, if yes, TensorRT will ignore performance evaluation and force it running specified mode if it have corresponding implementation.

As a result, it's hard to foresee which layer will be running what precision finally.

Although TensorRT doesn't provide straightforward API to query such information, you can get insight from verbose log (TensorRT 6.0),

[09/15/2019-13:28:03] [V] [TRT] Formats and tactics selection completed in 6.49572 seconds.
[09/15/2019-13:28:03] [V] [TRT] After reformat layers: 10 layers
[09/15/2019-13:28:03] [V] [TRT] Block size 16777216
[09/15/2019-13:28:03] [V] [TRT] Block size 2359296
[09/15/2019-13:28:03] [V] [TRT] Block size 589824
[09/15/2019-13:28:03] [V] [TRT] Total Activation Memory: 19726336
[09/15/2019-13:28:03] [I] [TRT] Detected 1 inputs and 1 output network tensors.
[09/15/2019-13:28:03] [V] [TRT] conv1 (icudnn) Set Tactic Name: volta_int8x4_icudnn_int8x4_128x32_relu_interior_c32_nn_v1
[09/15/2019-13:28:03] [V] [TRT] conv2 (i8816cudnn) Set Tactic Name: turing_int8_i8816cudnn_int8_256x64_ldg16_relu_singleBuffer_interior_nt_v1
[09/15/2019-13:28:03] [V] [TRT] Engine generation completed in 8.50567 seconds.
[09/15/2019-13:28:03] [V] [TRT] Engine Layer Information:
[09/15/2019-13:28:03] [V] [TRT] Layer: scale (Scale), Tactic: 0, data[Float(1,28,28)] -> scale[Float(1,28,28)]
[09/15/2019-13:28:03] [V] [TRT] Layer: conv1 input reformatter 0 (Reformat), Tactic: 0, scale[Float(1,28,28)] -> conv1 reformatted input 0[Int8(1,28,28)]
[09/15/2019-13:28:03] [V] [TRT] Layer: conv1 (icudnn), Tactic: -4208188808979933945, conv1 reformatted input 0[Int8(1,28,28)] -> conv1[Int8(20,24,24)]
[09/15/2019-13:28:03] [V] [TRT] Layer: pool1 (Pooling), Tactic: -4, conv1[Int8(20,24,24)] -> pool1[Int8(20,12,12)]
[09/15/2019-13:28:03] [V] [TRT] Layer: conv2 (i8816cudnn), Tactic: -545024964146453661, pool1[Int8(20,12,12)] -> conv2[Int8(50,8,8)]
[09/15/2019-13:28:03] [V] [TRT] Layer: pool2 (Pooling), Tactic: -4, conv2[Int8(50,8,8)] -> pool2[Int8(50,4,4)]
[09/15/2019-13:28:03] [V] [TRT] Layer: ip1 + relu1 input reformatter 0 (Reformat), Tactic: 0, pool2[Int8(50,4,4)] -> ip1 + relu1 reformatted input 0[Float(50,4,4)]
[09/15/2019-13:28:03] [V] [TRT] Layer: ip1 + relu1 (FullyConnected), Tactic: 1, ip1 + relu1 reformatted input 0[Float(50,4,4)] -> ip1[Float(500,1,1)]
[09/15/2019-13:28:03] [V] [TRT] Layer: ip2 (FullyConnected), Tactic: 1, ip1[Float(500,1,1)] -> ip2[Float(10,1,1)]
[09/15/2019-13:28:03] [V] [TRT] Layer: prob (SoftMax), Tactic: 1001, ip2[Float(10,1,1)] -> prob[Float(10,1,1)]

Take above mnist log (TensorRT 6.0) as an example,

  • Formats and tactics selection completed in xxx means TensorRT builder has done the precision and kernel selection.
  • conv1 (icudnn) Set Tactic Name: volta_int8x4_icudnn_int8x4_128x32_relu_interior_c32_nn_v1 means TensorRT builder choose CUDA core based INT8 kernel (indicates from kernel name sub-string 'int8x4') as the final tactic for conv1.
  • conv2 (i8816cudnn) Set Tactic Name: turing_int8_i8816cudnn_int8_256x64_ldg16_relu_singleBuffer_interior_nt_v1 means TensorRT builder choose Tensor core based INT8 kernel (indicates from kernel name sub-string 'i8816') as the final tactic for conv1.
  • The information after Engine Layer Information: list the detailed parameters of all the layer, like, tactic index, input format and output format, input shape and output shape.
  • conv1 input reformatter means the reformatting layer inserted by TensorRT builder in order to convert FP32 input to INT8 input for conv1.