TensorRT/How2Debug

How to dump the output of certain layer?
TensorRT doesn’t store the intermediate result of your network, so you have to use the following API to mark the intended layer as output layer, and then interference again and save its result for further analysis, network->markOutput(“layer_name”) NOTE:
 * You can set multiplier layers as the output at the same time, but setting the layer as output may break the network optimization and impact the inference performance, as TensorRT always runs output layer in FP32 mode, no matter which mode you have configured.
 * Don’t forget to adjust the dimension or output buffer size after you change the output layer.

How to debug ONNX model with setting extra output layer?
Sometimes we need to debug our model with dumping output of middle layer, this FAQ will show you a way to set middle layer as output for debugging ONNX model. The below steps are setting one middle layer of mnist.onnx model as output using the patch shown at the bottom.
 * 1) Download onnx-tensorrt and mnist.onnx
 * 2) Get all nodes info: Apply the first section "dump all nodes' output" change and build onx2trt.then run the  command to get all nodes: $ ./onnx2trt mnist.onnx -o mnist.engine
 * 3) Set one layer as output: Pick up the node name from the output of step2, and set it as output with the 2nd section "set one layer as output" change, rebuild onx2trt, and run below command to regenerate the engine. $ ./onnx2trt mnist.onnx -o mnist.engine
 * 4) Dump output with the engine file: $ ./trtexec  --engine=mnist.engine --input=Input3 --output=Plus214_Output_0 --output=Convolution110_Output_0 --dumpOutput

Here is the patch based on onnx-tensorrt diff --git a/ModelImporter.cpp b/ModelImporter.cpp index ac4749c..8638add 100644 --- a/ModelImporter.cpp +++ b/ModelImporter.cpp @@ -524,6 +524,19 @@ ModelImporter::importModel(::ONNX_NAMESPACE::ModelProto const &model,     output_names.push_back(model.graph.output(i).name);    } +  // ======= dump all nodes' output ============ +  int node_size = graph.node_size; +  cout << "ModelImporter::importModel : graph.node_size = " << node_size << " *******" << endl; +  for (int i = 0; i < graph.node_size; i++) { +         ::ONNX_NAMESPACE::NodeProto const& node = graph.node(i); +         if( node.output.size > 0 ) { +                 cout << "node[" << i << "] = " +                        << node.output(0) << ":" +                        << node.op_type << endl; +         } +  } +  // ========================================= +    string_map tensors;    TRT_CHECK(importInputs(&_importer_ctx, graph, &tensors, weight_count,                           weight_descriptors)); @@ -559,10 +572,17 @@ ModelImporter::importModel(::ONNX_NAMESPACE::ModelProto const &model, }   }    _current_node = -1; + + // =========== set one layer as output, below "Convolution110_Output_0" is from abobe dump == + nvinfer1::ITensor* new_output_tensor_ptr = &tensors.at("Convolution110_Output_0").tensor; + _importer_ctx.network->markOutput(*new_output_tensor_ptr); + // ========================================================================================== +

How to analyze network performance?
First of all, we should be aware of the profiling command tool that TensorRT provides - trtexec. If all your network layer has been supported by TensorRT through either native way or plugin way, you can always utilize this tool to profile your network very quickly. Second, you can add profiling metrics for your application manually from CPU side (link) or GPU side (link). NOTE: Third, if you would like to scope the time consumption of each layer, you can implement IProfiler to achieve that, or utilize SimpleProfiler TensorRT already provides (refer to below patch for sampleSSD), --- sampleSSD.cpp.orig	2019-05-27 12:39:14.193521455 +0800 +++ sampleSSD.cpp	2019-05-27 12:38:59.393358775 +0800 @@ -428,8 +428,11 @@    float* detectionOut = new float[N * kKEEP_TOPK * 7]; int* keepCount = new int[N]; +   SimpleProfiler profiler (" layer time"); +   context->setProfiler(&profiler); // Run inference doInference(*context, data, detectionOut, keepCount, N); +   std::cout << profiler; bool pass = true;
 * Time collection should only contain the network enqueue or execute and any context set-up or memory initialization or refill operation should be excluded.
 * Add more iterations for the time collection, in order to avagage the GPU warm-up effect.

Print the fused layer time in the order
1. Apply below change to make sure the layer time to be printed in the order 2. Apply below change to profile and pritn the layer time 3. Profile TensorRT-5.1.6.0/bin$ ./trtexec --deploy=ResNet50_N2.prototxt --output=prob --int8 --batch=128 &&&& RUNNING TensorRT.trtexec # ./trtexec --deploy=ResNet50_N2.prototxt --output=prob --int8 --batch=128 .. [I] Average over 10 runs is 25.8291 ms (host walltime is 25.8599 ms, 99% percentile time is 26.0829). ========== layertime profile ========== TensorRT layer name   Runtime, %  Invocations  Runtime, ms                                conv1 + conv1_relu input reformatter 0          1.6%          100        40.94 conv1 + conv1_relu         9.9%          100       256.82 pool1         2.8%          100        72.19 res2a_branch2a + res2a_branch2a_relu         1.2%          100        31.29 res2a_branch2b + res2a_branch2b_relu         2.1%          100        53.49 res2a_branch2c         3.2%          100        81.77 res2a_branch1 + res2a + res2a_relu         3.7%          100        96.50 res2b_branch2a + res2b_branch2a_relu         2.0%          100        51.35 res2b_branch2b + res2b_branch2b_relu         2.1%          100        53.62 res2b_branch2c + res2b + res2b_relu         3.7%          100        96.38 res2c_branch2a + res2c_branch2a_relu         2.0%          100        51.28 res2c_branch2b + res2c_branch2b_relu         2.1%          100        53.81 ......  ......... res5c_branch2a + res5c_branch2a_relu         0.8%          100        21.43 res5c_branch2b + res5c_branch2b_relu         1.7%          100        43.53 res5c_branch2c + res5c + res5c_relu         1.0%          100        26.37 pool5         0.4%          100        10.40 fc1000 input reformatter 0         0.1%          100         1.62 fc1000         0.6%          100        16.06 prob         0.1%          100         1.31 ========== layertime total runtime = 2585.28 ms ========== In above log, you can also find out which layers are fused, for exmaple, "res5c_branch2c + res5c + res5c_relu" indicates layer res5c_branch2c, res5c and res5c_relu are fused as one layer.